Data Serialization Guide | JSON, Protocol Buffers, and Avro

Choosing the right data serialization format affects performance, storage costs, and developer experience. Different formats excel in different scenarios. Here's a comparison to help you choose wisely.

Format Overview

Comparison Matrix

Before diving into each format, this comparison matrix provides a quick reference for choosing based on your primary requirements. Consider human readability for debugging and configuration files, schema requirements for type safety, and size/speed for high-throughput systems.

Format     | Human     | Schema  | Size    | Speed   | Use Case
           | Readable  |         |         |         |
-----------+-----------+---------+---------+---------+------------------
JSON       | Yes       | Optional| Medium  | Medium  | APIs, configs
YAML       | Yes       | Optional| Medium  | Slow    | Configs, k8s
XML        | Yes       | Optional| Large   | Slow    | Enterprise, SOAP
Protobuf   | No        | Required| Small   | Fast    | Microservices
Avro       | No        | Required| Small   | Fast    | Big data, Kafka
MessagePack| No        | Optional| Small   | Fast    | Cache, sockets
CBOR       | No        | Optional| Small   | Fast    | IoT, constrained
Parquet    | No        | Required| Small   | Fast    | Analytics, data lake

The trade-offs between human readability and performance are fundamental. Development-time convenience often favors readable formats, while production performance favors binary formats.

JSON

Characteristics

JSON has become the lingua franca of web APIs due to its simplicity and universal support. Its self-describing nature means you can understand the data structure without external documentation, making it excellent for debugging and exploration.

This example shows a typical JSON payload with various data types including nested objects, arrays, and null values.

{
  "user": {
    "id": 12345,
    "name": "John Doe",
    "email": "john@example.com",
    "roles": ["admin", "user"],
    "active": true,
    "metadata": null
  }
}

Pros and Cons

Understanding JSON's limitations helps you decide when to use alternatives. The lack of binary and date types requires workarounds that add overhead and potential for inconsistency.

Pros:
+ Human readable/writable
+ Universal support (every language)
+ No schema required
+ Easy debugging
+ Browser native (JavaScript)

Cons:
- No native binary type (base64 encoding)
- No date type (string representation)
- Verbose (field names repeated)
- No comments allowed
- Slower parsing than binary formats

Best Practices

When working with JSON in production, optimize encoding settings and consider streaming for large datasets. The following PHP examples show common optimizations for reducing payload size and handling large data efficiently.

For small to medium payloads, encoding options can reduce size. For large datasets, streaming prevents memory exhaustion by processing records incrementally.

// Efficient JSON encoding
$data = ['users' => $users, 'total' => count($users)];

// Default
$json = json_encode($data);

// Compact (no pretty printing)
$json = json_encode($data, JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE);

// With streaming for large data
$stream = fopen('php://output', 'w');
fwrite($stream, '[');
$first = true;
foreach ($users as $user) {
    if (!$first) fwrite($stream, ',');
    fwrite($stream, json_encode($user));
    $first = false;
}
fwrite($stream, ']');

The streaming approach processes records one at a time, avoiding memory exhaustion when encoding millions of records. This is essential for export functionality and bulk data operations.

Protocol Buffers

Schema Definition

Protocol Buffers (Protobuf) use a strongly-typed schema that defines message structure. The schema serves as documentation and enables code generation for type-safe serialization in any supported language.

The field numbers in a Protobuf schema are crucial for wire format compatibility. They identify fields in the binary encoding and must remain stable across schema versions.

// user.proto
syntax = "proto3";

package myapp;

message User {
  int64 id = 1;
  string name = 2;
  string email = 3;
  repeated string roles = 4;
  bool active = 5;
  optional string metadata = 6;
  google.protobuf.Timestamp created_at = 7;
}

message UserList {
  repeated User users = 1;
  int32 total = 2;
}

The field numbers are crucial for schema evolution. They identify fields in the binary format, allowing you to add new fields without breaking existing consumers.

Usage

After defining your schema, use the Protocol Buffers compiler to generate language-specific classes. These generated classes provide type-safe methods for building and parsing messages.

This example shows the workflow from generating code to serializing and deserializing data. The size comparison demonstrates why Protobuf is preferred for high-throughput systems.

// Generate classes: protoc --php_out=. user.proto

use Myapp\User;
use Myapp\UserList;

// Serialize
$user = new User();
$user->setId(12345);
$user->setName('John Doe');
$user->setEmail('john@example.com');
$user->setRoles(['admin', 'user']);
$user->setActive(true);

$binary = $user->serializeToString();

// Deserialize
$decoded = new User();
$decoded->mergeFromString($binary);
echo $decoded->getName(); // "John Doe"

// Size comparison for same data:
// JSON: 156 bytes
// Protobuf: 52 bytes (67% smaller)

The 67% size reduction compounds across millions of messages. For high-throughput systems processing billions of events, this translates to significant bandwidth and storage savings.

Schema Evolution

Protobuf supports safe schema evolution as long as you follow certain rules. New fields can be added, and old clients will simply ignore them, maintaining backward compatibility.

These evolution rules ensure that old clients can read new data and new clients can read old data. The key is never reusing field numbers and always providing defaults for new fields.

// v1
message User {
  int64 id = 1;
  string name = 2;
}

// v2 - Adding fields (backward compatible)
message User {
  int64 id = 1;
  string name = 2;
  string email = 3;        // New field - old clients ignore
  optional int32 age = 4;  // Optional field
}

// Rules:
// - Never reuse field numbers
// - Use optional for nullable fields
// - Add new fields with new numbers
// - Deprecated fields: keep number reserved

Reserve field numbers of deleted fields to prevent accidental reuse. This protects against subtle bugs when different versions of your schema interpret the same field number differently.

Apache Avro

Schema

Avro uses JSON-based schemas that are stored with the data, making it self-describing while still being compact. This is particularly valuable in data processing pipelines where data may be read years after it was written.

The schema is verbose but human-readable, and it's stored alongside the data in Avro files, ensuring you can always decode the data even without access to the original schema definition.

{
  "type": "record",
  "name": "User",
  "namespace": "com.myapp",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "roles", "type": {"type": "array", "items": "string"}},
    {"name": "active", "type": "boolean"},
    {"name": "metadata", "type": ["null", "string"], "default": null}
  ]
}

Kafka Integration

Avro shines when combined with a Schema Registry for managing schema evolution in event-driven architectures. The registry ensures producers and consumers agree on message format while allowing controlled evolution.

This Python example shows the typical Avro-Kafka integration pattern. The Schema Registry validates schema compatibility before accepting new versions.

from confluent_kafka import Producer
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.avro import AvroSerializer

# Schema Registry for schema management
schema_registry = SchemaRegistryClient({'url': 'http://schema-registry:8081'})

# Avro serializer with schema evolution
avro_serializer = AvroSerializer(
    schema_registry,
    schema_str,
    to_dict=lambda user, ctx: user.__dict__
)

# Produce message
producer = Producer({'bootstrap.servers': 'kafka:9092'})
producer.produce(
    topic='users',
    key=str(user.id),
    value=avro_serializer(user, SerializationContext('users', MessageField.VALUE))
)

The Schema Registry validates schema compatibility before allowing changes, preventing accidental breaking changes that would cause consumer failures.

Schema Evolution

Avro provides explicit compatibility modes that control how schemas can evolve. Understanding these modes helps you design schemas that can grow with your system without breaking existing data or consumers.

When adding new fields to an Avro schema, you must provide default values for backward compatibility. This allows old records without the new field to be read by the new schema.

// v1 schema
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"}
  ]
}

// v2 schema - backward compatible
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "long"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string", "default": ""}  // Default required!
  ]
}

// Compatibility modes:
// BACKWARD: New schema can read old data
// FORWARD: Old schema can read new data
// FULL: Both directions

The default value is essential for backward compatibility. Without it, old records that lack the email field cannot be read by the new schema.

MessagePack

Usage

MessagePack provides a binary alternative to JSON that's faster and smaller while maintaining JSON's schemaless flexibility. It's excellent for caching and inter-process communication where you want JSON's convenience with better performance.

The API is nearly identical to JSON, making it easy to adopt. The size and speed improvements shown here are typical for structured data.

// Very similar to JSON but binary

// Encode
$data = [
    'id' => 12345,
    'name' => 'John Doe',
    'roles' => ['admin', 'user'],
];
$packed = msgpack_pack($data);

// Decode
$unpacked = msgpack_unpack($packed);

// Size comparison:
// JSON: 68 bytes
// MessagePack: 45 bytes (34% smaller)

// Speed comparison (1M iterations):
// JSON encode: 2.1 seconds
// MessagePack: 0.8 seconds (2.6x faster)

MessagePack preserves numeric types exactly, unlike JSON which converts everything to IEEE 754 floating point. This matters when you need precise large integers or want to distinguish between integers and floats.

Redis Caching

MessagePack's compact size and fast serialization make it ideal for caching. The smaller payloads reduce Redis memory usage, and the faster serialization reduces CPU time for cache operations.

This cache wrapper demonstrates the pattern for using MessagePack with Redis. The implementation is straightforward and provides immediate benefits for memory and CPU usage.

class MessagePackCache
{
    public function set(string $key, mixed $value, int $ttl = 3600): void
    {
        $packed = msgpack_pack($value);
        $this->redis->setex($key, $ttl, $packed);
    }

    public function get(string $key): mixed
    {
        $packed = $this->redis->get($key);
        if ($packed === null) {
            return null;
        }
        return msgpack_unpack($packed);
    }
}

// Benefits for caching:
// - Smaller memory footprint
// - Faster serialization
// - Preserves types (unlike JSON numbers)

The memory savings compound when caching large datasets. A 34% reduction in cached object size means your Redis instance can store 50% more objects before needing to evict or upgrade.

Parquet (Columnar)

For Analytics

Parquet stores data in columnar format, which is fundamentally different from row-based formats like JSON or Protobuf. This columnar layout provides dramatic performance improvements for analytical queries that access only a subset of columns.

When reading Parquet files, you can specify which columns to load, avoiding I/O for columns you don't need. Predicate pushdown further optimizes by skipping row groups that don't match your filter.

import pyarrow as pa
import pyarrow.parquet as pq

# Create table
data = {
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'amount': [100.5, 200.0, 150.75, 300.25, 50.0],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
}
table = pa.Table.from_pydict(data)

# Write to Parquet
pq.write_table(table, 'data.parquet', compression='snappy')

# Read specific columns only (columnar benefit)
table = pq.read_table('data.parquet', columns=['id', 'amount'])

# Predicate pushdown
table = pq.read_table(
    'data.parquet',
    filters=[('amount', '>', 100)]
)

Predicate pushdown allows Parquet readers to skip entire row groups that don't match your filter criteria, reading only the data that could possibly satisfy your query.

Why Columnar for Analytics

The columnar format fundamentally changes I/O patterns for analytical queries. Instead of reading entire rows to access a single column, you read only the columns you need.

This comparison illustrates why columnar formats dominate in analytics. An aggregation query over one column reads only that column's data, regardless of how many other columns exist.

Row format (JSON, Avro):
Row 1: [id:1, name:Alice, amount:100.5, date:2024-01-01]
Row 2: [id:2, name:Bob,   amount:200.0, date:2024-01-02]

Query: SELECT SUM(amount) FROM users
Reads: All columns for all rows

Columnar format (Parquet):
Column 'id':     [1, 2, 3, 4, 5]
Column 'name':   [Alice, Bob, Charlie, Diana, Eve]
Column 'amount': [100.5, 200.0, 150.75, 300.25, 50.0]
Column 'date':   [2024-01-01, ...]

Query: SELECT SUM(amount) FROM users
Reads: Only 'amount' column

Benefits:
- Much less I/O for analytics queries
- Better compression (similar values together)
- Vectorized processing

Columnar compression works exceptionally well because similar values are stored together. A column of country codes compresses much better than countries scattered across rows.

Choosing the Right Format

Decision Guide

This decision tree helps you select the appropriate format based on your specific use case. Start with your primary requirement and follow the branches to find the best match.

API responses:
├── External API → JSON (universal)
├── Internal microservices → Protobuf (performance)
└── GraphQL → JSON (native)

Configuration:
├── Human-edited → YAML or JSON
├── Machine-generated → JSON
└── Complex/nested → YAML

Caching:
├── Simple values → Native (string, int)
├── Complex objects → MessagePack
└── Large datasets → Compressed JSON

Message Queues:
├── Schema-enforced → Avro with Schema Registry
├── High throughput → Protobuf
└── Flexible → JSON

Data Lake/Analytics:
├── Columnar queries → Parquet
├── Streaming → Avro
└── Log files → JSON Lines (JSONL)

Performance Benchmarks

These benchmarks demonstrate the performance differences between formats for typical workloads. Your actual results will vary based on data characteristics and hardware.

Serialization speed (1M objects):
MessagePack: 0.8s
Protobuf:    1.2s
JSON:        2.1s
YAML:        15.0s

Deserialization speed (1M objects):
Protobuf:    0.9s
MessagePack: 1.1s
JSON:        3.5s
YAML:        20.0s

Size (same data):
Protobuf:    52 bytes (baseline)
MessagePack: 65 bytes (+25%)
JSON:        156 bytes (+200%)
YAML:        180 bytes (+246%)

YAML's slow performance makes it unsuitable for high-throughput data processing, but its readability makes it excellent for configuration files that are parsed once at startup.

Hybrid Approaches

Real systems often use multiple formats at different boundaries. An API gateway might accept JSON from external clients, convert to Protobuf for internal communication, and use MessagePack for caching. This combines the benefits of each format where they matter most.

This example shows format translation at system boundaries. External clients use familiar JSON while internal services benefit from binary formats. The cache uses MessagePack for efficiency.

// API Gateway pattern
class ApiGateway
{
    public function handleRequest(Request $request): Response
    {
        // Accept JSON from external clients
        $data = json_decode($request->getContent(), true);

        // Convert to Protobuf for internal services
        $proto = $this->toProtobuf($data);
        $response = $this->internalService->call($proto);

        // Convert back to JSON for response
        return response()->json(
            $this->fromProtobuf($response)
        );
    }
}

// Cache with MessagePack, respond with JSON
class UserService
{
    public function getUser(int $id): array
    {
        // Try MessagePack cache
        $cached = $this->cache->get("user:{$id}");
        if ($cached) {
            return msgpack_unpack($cached);
        }

        $user = $this->repository->find($id);
        $this->cache->set("user:{$id}", msgpack_pack($user));

        return $user;  // Returned as JSON via response
    }
}

The conversion overhead is usually negligible compared to network latency. The performance gains from using efficient internal formats typically far outweigh the cost of format conversion at boundaries.

Conclusion

JSON remains the default for APIs due to universal support and human readability. Use Protobuf or MessagePack when performance and size matter for internal services. Avro with Schema Registry works well for Kafka and data pipelines with schema evolution needs. Parquet excels for analytics workloads with columnar queries. Match the format to your use case rather than using one format everywhere. Consider hybrid approaches where external interfaces use JSON while internal communication uses binary formats.

Data Serialization Format Comparison

Format Overview

Comparison Matrix

JSON

Characteristics

Pros and Cons

Best Practices

Protocol Buffers

Schema Definition

Usage

Schema Evolution

Apache Avro

Schema

Kafka Integration

Schema Evolution

MessagePack

Usage

Redis Caching

Parquet (Columnar)

For Analytics

Why Columnar for Analytics

Choosing the Right Format

Decision Guide

Performance Benchmarks

Hybrid Approaches

Conclusion

Share this article

Related Articles

Cell-Based Architecture: Blast Radius Isolation at Scale

The Strangler Fig Pattern: Migrating Legacy Systems Without the Big Rewrite

Quantifying Technical Debt: Metrics That Actually Drive Action

Need help with your project?

ScopeForged Assistant