Back to Articles

jsonlines: Why Python Developers Keep Reinventing the JSON Lines Wheel

[ View on GitHub ]

jsonlines: Why Python Developers Keep Reinventing the JSON Lines Wheel

Hook

Every Python developer has written this code at least once: for line in f: json.loads(line). Then they've debugged why their production logs fail to parse at 3 AM.

Context

JSON Lines (also called JSONL or NDJSON) has become the de facto standard for streaming JSON data, log aggregation, and machine learning datasets. The format is brilliantly simple: one valid JSON object per line, separated by newline characters. Tools like Elasticsearch, Logstash, and BigQuery natively support it. Data scientists use it for training datasets because it's trivially parallelizable—you can split a 100GB file by lines and process chunks independently.

But here's the paradox: despite its conceptual simplicity, correctly handling JSON Lines in Python is surprisingly error-prone. Do you use \n or \r\n? What happens when JSON strings contain embedded newlines? How do you handle trailing newlines in files? Should you open files in text or binary mode? These questions lead to Stack Overflow searches and copy-pasted code that works in development but breaks in production when encountering real-world data quirks. The wbolster/jsonlines library emerged to solve this repetitive problem with a minimal, focused API.

Technical Insight

The jsonlines library provides two primary interfaces: a Reader for consuming JSONL data and a Writer for producing it. The architecture is deliberately thin—it's a focused wrapper around Python's standard library rather than a feature-bloated framework. Let's examine the read path first:

import jsonlines

# The traditional approach (fragile)
with open('events.jsonl', 'r') as f:
    for line in f:
        if line.strip():  # Don't forget this!
            event = json.loads(line)
            process(event)

# With jsonlines (robust)
with jsonlines.open('events.jsonl') as reader:
    for event in reader:
        process(event)

The difference appears cosmetic, but the library handles several edge cases invisibly. It strips whitespace correctly, handles both \n and \r\n line endings, and validates that each line contains valid JSON before iteration continues. More importantly, it provides a proper iterator interface that integrates cleanly with Python's iteration protocol.

The write path showcases the library's real value proposition. Writing JSON Lines correctly requires precise newline handling and ensuring compact JSON output (no pretty-printing that would introduce line breaks within objects):

import jsonlines
from decimal import Decimal

events = [
    {'timestamp': '2024-01-15', 'revenue': Decimal('129.99')},
    {'timestamp': '2024-01-16', 'revenue': Decimal('89.50')}
]

# Writes with proper newline delimiters, no trailing newline issues
with jsonlines.open('output.jsonl', mode='w') as writer:
    writer.write_all(events)

# Or write one at a time with automatic flushing control
with jsonlines.open('stream.jsonl', mode='w', flush=True) as writer:
    for event in generate_events():
        writer.write(event)  # Immediately flushed to disk

One architectural decision worth highlighting: the library exposes all of Python's standard json module parameters through keyword arguments. This means you can pass custom encoders, control float precision, or use parse_float=Decimal for financial data without losing jsonlines' conveniences:

import jsonlines
from datetime import datetime

class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        return super().default(obj)

with jsonlines.open('logs.jsonl', mode='w', dumps=CustomEncoder().encode) as writer:
    writer.write({'event': 'login', 'time': datetime.now()})

The library also handles the subtle distinction between text and binary modes intelligently. When you open a file in binary mode, jsonlines automatically encodes/decodes UTF-8, which is crucial for processing JSONL files from diverse sources where encoding might be ambiguous. This prevents the dreaded UnicodeDecodeError that plagues naive implementations.

Performance-wise, the library introduces minimal overhead—benchmarks show it's typically within 5% of hand-rolled implementations using the json module directly. The abstraction cost is negligible because the library doesn't buffer entire files or do unnecessary data copying; it's essentially a state machine managing newline delimiters and JSON parsing boundaries.

Gotcha

The jsonlines library shines for straightforward use cases but has clear boundaries you'll hit in production scenarios. First, it provides no streaming optimizations for genuinely massive files. If you're processing a 50GB JSONL file, the library reads and parses line-by-line, which means it's no faster than a naive implementation. There's no chunking, no parallel processing support, and no memory mapping. For data engineering pipelines processing terabyte-scale logs, you'll need to build your own multiprocessing wrapper or reach for tools like Dask or Spark.

Second, error handling is basic. When a malformed JSON line appears at line 10,000 in your file, you get a generic json.JSONDecodeError with no context about which line failed or how to skip and continue. There's no built-in resilience mode for dirty data, no schema validation, and no logging hooks. Production data pipelines often need to handle "mostly valid" JSONL files where 0.1% of lines are corrupted—this library gives you no tools for that scenario beyond try-catch blocks. Additionally, the library has seen minimal updates recently (despite being in conda distributions), which suggests maintenance is sporadic rather than active. For mission-critical applications, this maintenance velocity matters.

Verdict

Use if: You're building data processing scripts, ETL pipelines, or log parsers where JSON Lines is the input/output format and you want clean, readable code without reinventing newline handling. It's perfect for medium-sized datasets (gigabytes, not terabytes) where correctness and code clarity matter more than squeezing out microseconds. The library is especially valuable in team environments where you want to enforce consistent JSONL handling patterns across codebases. Skip if: You're processing truly massive files requiring parallel processing or streaming optimizations, you need sophisticated error recovery for malformed data, or you're already using pandas (which has read_json(lines=True) built-in). Also skip if you're writing a library yourself and want to minimize dependencies—the jsonlines API is simple enough to implement in 30 lines if you study its source code. For quick scripts where you're already comfortable with json.loads(), the abstraction might feel like unnecessary indirection.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/wbolster-jsonlines.svg)](https://starlog.is/api/badge-click/data-knowledge/wbolster-jsonlines)