Skip to content

Kafka Serializers Avro Pt1

Serializing with Apache Avro๐Ÿ”—


๐Ÿงฉ What is Avro?๐Ÿ”—

Apache Avro is a data serialization system โ€” meaning it defines how data is structured, stored, and transmitted between systems in a compact and efficient way.

It was created by Doug Cutting (also the creator of Hadoop and Lucene) to solve a common problem in distributed systems:

โ€œHow do you exchange complex structured data between systems written in different languages โ€” without losing type information or causing compatibility issues?โ€


โš™๏ธ How Avro Works๐Ÿ”—

At its core, Avro works with two components:

  1. Schema โ€“ Defines the structure (fields, types, defaults) of your data.
  2. Data โ€“ The actual binary or JSON-encoded data following that schema.

So Avro separates โ€œwhat the data looks likeโ€ (schema) from โ€œwhat the data isโ€ (values).


๐Ÿ“„ Avro Schema Example (in JSON)๐Ÿ”—

Hereโ€™s how a simple Avro schema looks:

{
  "type": "record",
  "name": "Employee",
  "namespace": "com.company",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "department", "type": ["null", "string"], "default": null}
  ]
}

This defines a record with three fields:

  • id (integer)
  • name (string)
  • department (optional string โ€” can be null)

๐Ÿ’พ Serialization and Deserialization๐Ÿ”—

  • Serialization: Converting data (e.g., a Python object) into Avro binary using the schema.
  • Deserialization: Reading Avro binary data back into a usable form using the same or compatible schema.

Because the schema is embedded in Avro files, they are self-describing โ€” any system can read them if it supports Avro.


๐Ÿง  Why Avro Is Special๐Ÿ”—

Here are Avroโ€™s key advantages:

Feature Description
๐Ÿ—‚๏ธ Schema-based Data has a well-defined structure stored in JSON format.
โšก Compact Binary Format Binary encoding reduces file size and improves I/O performance.
๐Ÿ”„ Schema Evolution You can change schemas over time (add, rename, remove fields) without breaking existing consumers.
๐ŸŒ Language Neutral Works with many languages (Java, Python, C++, Go, etc.).
๐Ÿ’ฌ Great for Messaging Systems Used in Kafka, Redpanda, and Schema Registry setups.
๐Ÿงฑ Splittable and Compressible Ideal for big data systems (Hadoop, Spark, Hive).

๐Ÿ”„ Schema Evolution in Avro๐Ÿ”—

This is the most powerful part โ€” and the reason Avro is heavily used in Kafka.

Suppose your producerโ€™s old schema was:

{"name": "id", "type": "int"}

Now you update it to:

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

โœ… Avro allows backward and forward compatibility:

  • Backward-compatible: New consumers can read old data.
  • Forward-compatible: Old consumers can read new data (using defaults for new fields).

Thatโ€™s why Kafka uses Avro with a Schema Registry โ€” to ensure producers and consumers can evolve independently.


๐Ÿงฐ Where Avro is Commonly Used๐Ÿ”—

Use Case Description
๐Ÿชฃ Data Lakes (HDFS, S3) Store schema-defined data for Spark/Hive.
๐Ÿงต Kafka Messaging Producers publish Avro messages, Schema Registry keeps schema versions.
๐Ÿงฌ ETL Pipelines Efficient and schema-safe data transfer between stages.
๐Ÿงฎ Analytics Compact binary format makes large datasets efficient to query.

๐Ÿ“Š Avro vs. Other Formats๐Ÿ”—

Feature Avro JSON Parquet ORC
Storage Type Row-based Text Columnar Columnar
Schema Yes (JSON-defined) No Yes Yes
Compression Yes Poor Excellent Excellent
Best For Streaming, Kafka Debugging, APIs Analytics, OLAP Analytics, OLAP
Human Readable No Yes No No

So Avro is ideal for data interchange (Kafka, APIs, log pipelines), while Parquet/ORC are better for data storage and querying.


๐Ÿงฉ Summary๐Ÿ”—

Apache Avro = Schema + Compact Binary + Evolution Support Itโ€™s designed for:

  • Cross-language data exchange
  • Streaming and message serialization
  • Schema evolution without breaking systems

And thatโ€™s why itโ€™s a favorite in modern data pipelines, especially with Kafka, Redpanda, Spark, and Schema Registry.


Caveats While Using Avro๐Ÿ”—


๐Ÿงฉ Caveat 1:๐Ÿ”—

โ€œThe schema used for writing the data and the schema expected by the reading application must be compatible.โ€

๐Ÿ” What this means๐Ÿ”—

When Avro serializes (writes) and deserializes (reads) data, both sides must agree on the structure of the data โ€” at least enough to interpret the fields correctly.

If the producer (writer) and consumer (reader) use incompatible schemas, the data canโ€™t be decoded properly.


๐Ÿง  Example๐Ÿ”—

Writerโ€™s Schema (v1)๐Ÿ”—

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"}
  ]
}

Readerโ€™s Schema (v2)๐Ÿ”—

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "full_name", "type": "string"}
  ]
}

Here, the field name changed from "name" โ†’ "full_name".

๐Ÿ’ฅ Result: Not compatible โ€” Avro wonโ€™t know how to map "name" to "full_name". The reader will fail to interpret the data.


โœ… Compatible Evolution Example๐Ÿ”—

If we add a new optional field, Avro can handle that gracefully.

Writerโ€™s Schema (old)๐Ÿ”—

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"}
  ]
}

Readerโ€™s Schema (new)๐Ÿ”—

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

๐ŸŸข Compatible โ€” The new field email has a default value, so old data still works fine.


โš–๏ธ Compatibility Rules (Simplified)๐Ÿ”—

Avro defines several compatibility types:

Compatibility Type Meaning
Backward New readers can read old data.
Forward Old readers can read new data.
Full Both backward + forward.
Breaking Schema changes that break both directions (e.g. removing required fields).

You can configure which rule to enforce in systems like the Confluent Schema Registry.


๐Ÿงฉ Caveat 2:๐Ÿ”—

โ€œThe deserializer will need access to the schema that was used when writing the data, even when it is different from the schema expected by the application that accesses the data.โ€

๐Ÿ” What this means๐Ÿ”—

To read Avro data, you always need:

  • The writerโ€™s schema (used when data was created)
  • The readerโ€™s schema (used by your application)

Avro uses both together to resolve differences.

If your application only knows its own schema (reader schema) but not the original one, it canโ€™t interpret the binary data โ€” because Avro binary doesnโ€™t include field names, just encoded values.


๐Ÿง  Analogy๐Ÿ”—

Think of Avro data as a compressed shopping list:

  • The writerโ€™s schema says the order of items: 1๏ธโƒฃ "Milk" โ†’ string 2๏ธโƒฃ "Eggs" โ†’ int
  • The data only stores the values: "Amul", 12

If the reader doesnโ€™t know the original order (schema), it canโ€™t tell which value corresponds to which field.


๐Ÿงฉ Caveat 3:๐Ÿ”—

โ€œIn Avro files, the writing schema is included in the file itself, but there is a better way to handle this for Kafka messages.โ€

๐Ÿ” What this means๐Ÿ”—

When Avro is used for files (like on HDFS or S3), the schema is embedded inside the file header. Thatโ€™s fine for static data โ€” every file carries its own schema.

But in Kafka, embedding the full schema inside every message would be inefficient:

  • Each message might be just a few KB,
  • But the schema (JSON) could be several hundred bytes.

That would cause massive duplication and network overhead.


๐Ÿง  The โ€œBetter Wayโ€ (Hinted at in the text)๐Ÿ”—

Instead of embedding schemas in every Kafka message, systems like Confluent Schema Registry store schemas centrally.

Hereโ€™s how it works:

  1. When a producer sends a message:

  2. It registers its schema once with the Schema Registry.

  3. It gets a schema ID (like 42).
  4. It sends the binary Avro data prefixed with that schema ID.

  5. When a consumer reads a message:

  6. It looks up schema ID 42 in the Schema Registry.

  7. It retrieves the writerโ€™s schema and deserializes the message correctly.

This way:

  • โœ… Small message sizes
  • โœ… Centralized schema management
  • โœ… Schema evolution tracking

๐Ÿ—๏ธ Summary Table๐Ÿ”—

Concept In Avro Files In Kafka Messages
Where is schema stored? Embedded inside file Stored in Schema Registry
Message overhead Higher (includes schema) Lower (schema ID only)
Schema evolution File-based Centrally managed (versioned)
Typical use case Batch systems (HDFS, S3, Hive) Streaming systems (Kafka, Redpanda)

๐Ÿงฉ In Short๐Ÿ”—

Caveat Meaning Solution
1๏ธโƒฃ Writer/Reader schema must be compatible Data wonโ€™t deserialize if incompatible Follow Avro compatibility rules
2๏ธโƒฃ Reader needs writerโ€™s schema Required to decode binary data Include schema in file or fetch from registry
3๏ธโƒฃ Donโ€™t embed schema per message in Kafka Wasteful and redundant Use Schema Registry (stores schema IDs)

Demo Producer Code with Schema Registry in Python๐Ÿ”—

from confluent_kafka.avro import AvroProducer
import json

# Avro schema as JSON string or loaded from file
user_schema_str = """
{
  "namespace": "example.avro",
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}
"""

# Kafka + Schema Registry configuration (example placeholders)
conf = {
    'bootstrap.servers': 'kafka-broker1:9092,kafka-broker2:9092',
    'schema.registry.url': 'http://schema-registry-host:8081',
    # If SR needs auth, add 'schema.registry.basic.auth.user.info': 'user:password'
}

# The AvroProducer takes the schema used for the value (and optionally key schema)
producer = AvroProducer(conf, default_value_schema=json.loads(user_schema_str))

topic = "users-avro"

def produce_user(user_obj):
    # key_schema can also be provided; here we use string keys
    key = str(user_obj["id"])
    producer.produce(topic=topic, value=user_obj, key=key)
    producer.flush()   # flush per message for demo; in production batch/async flush

if __name__ == "__main__":
    user = {"id": 1, "name": "Alice", "email": "alice@example.com"}
    produce_user(user)
    print("Sent Avro message to", topic)

Demo Producer Code In Java๐Ÿ”—

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer",
"io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("value.serializer",
"io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("schema.registry.url", schemaUrl);
String topic = "customerContacts";
Producer<String, Customer> producer = new KafkaProducer<>(props);
// We keep producing new events until someone ctrl-c
while (true) {
Customer customer = CustomerGenerator.getNext();
System.out.println("Generated customer " +
customer.toString());
ProducerRecord<String, Customer> record =
new ProducerRecord<>(topic, customer.getName(), customer);
producer.send(record);
}

๐Ÿงฉ The Big Picture๐Ÿ”—

This Java code is an Avro Kafka producer. It:

  1. Connects to Kafka.
  2. Uses Confluentโ€™s Avro serializers.
  3. Generates Avro objects (Customer records).
  4. Sends them continuously to a Kafka topic.

๐Ÿงฑ Step-by-step Explanation๐Ÿ”—

1๏ธโƒฃ Create configuration properties๐Ÿ”—

Properties props = new Properties();

๐Ÿ‘‰ Properties is a Java Map-like object that stores key-value configuration settings for the Kafka producer.


2๏ธโƒฃ Configure Kafka connection and serialization๐Ÿ”—

props.put("bootstrap.servers", "localhost:9092");

๐Ÿ”น bootstrap.servers โ€” tells the producer where Kafka brokers are located. The producer will connect to this address to find the rest of the cluster.


props.put("key.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer");

๐Ÿ”น These lines configure how keys and values are converted into bytes before being sent over the network.

  • Both the key and value use the Confluent Avro serializer.
  • The serializer:

  • Converts Avro objects (like Customer) into Avro binary format.

  • Registers the Avro schema with the Schema Registry.
  • Sends the schema ID (a small integer) along with the serialized message.

So the consumer can later retrieve the schema and deserialize properly.


props.put("schema.registry.url", schemaUrl);

๐Ÿ”น This tells the serializer where the Schema Registry is (e.g., http://localhost:8081).

The serializer uses this URL to:

  • Check if the schema for Customer already exists in the registry.
  • Register it if not.
  • Retrieve schema IDs for encoding messages.

3๏ธโƒฃ Create Kafka topic variable๐Ÿ”—

String topic = "customerContacts";

Just defines the target Kafka topic to send messages to.


4๏ธโƒฃ Create the producer instance๐Ÿ”—

Producer<String, Customer> producer = new KafkaProducer<>(props);

๐Ÿ”น This creates the Kafka producer client.

  • Producer<K, V> โ€” a generic class parameterized by key and value types. Here:

  • K = String (customer name, used as key)

  • V = Customer (the Avro object)

The producer will use the serializers defined earlier to encode these before sending.


5๏ธโƒฃ Generate and send Avro messages in a loop๐Ÿ”—

while (true) {
    Customer customer = CustomerGenerator.getNext();
    System.out.println("Generated customer " + customer.toString());
    ProducerRecord<String, Customer> record =
        new ProducerRecord<>(topic, customer.getName(), customer);
    producer.send(record);
}

Letโ€™s break this down line by line.


๐Ÿง  Customer customer = CustomerGenerator.getNext();๐Ÿ”—

This uses a helper class CustomerGenerator (not shown here) that probably:

  • Creates random or mock customer data.
  • Returns a Customer object (which is an Avro-generated Java class).

๐Ÿ–จ๏ธ System.out.println(...)๐Ÿ”—

Just logs the generated customer to the console for visibility.


๐Ÿงพ ProducerRecord<String, Customer> record = new ProducerRecord<>(topic, customer.getName(), customer);๐Ÿ”—

This creates the message record to be sent to Kafka.

  • The topic: "customerContacts".
  • The key: customer.getName().
  • The value: the entire Customer Avro object.

Kafka uses the key to determine which partition the message goes to (same keys โ†’ same partition).


๐Ÿš€ producer.send(record);๐Ÿ”—

This sends the record asynchronously to Kafka.

The Avro serializer will:

  1. Serialize Customer into binary Avro.
  2. Register or look up the schema in Schema Registry.
  3. Encode the schema ID + binary payload.
  4. Send the message to the Kafka broker.

๐ŸŒ€ while (true) {...}๐Ÿ”—

Runs indefinitely โ€” keeps producing customer messages until you stop it with Ctrl + C.

In a real-world app, you might:

  • Add a Thread.sleep() for pacing,
  • Limit message count, or
  • Gracefully close the producer with producer.close().

๐Ÿงฉ What the Customer class is๐Ÿ”—

This class is likely generated automatically from an Avro schema using the Avro Java code generator.

For example, the Avro schema could look like:

{
  "namespace": "example.avro",
  "type": "record",
  "name": "Customer",
  "fields": [
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"},
    {"name": "phoneNumber", "type": ["null", "string"], "default": null}
  ]
}

After generating Java classes from this schema (avro-maven-plugin or avro-tools), you can use the Customer object directly in code โ€” the Avro serializer knows how to handle it.


๐Ÿง  Summary Table๐Ÿ”—

Line What it does Why it matters
bootstrap.servers Broker list Connect to Kafka cluster
key.serializer / value.serializer Avro serializers Convert Java objects to bytes
schema.registry.url Registry URL Store and retrieve schemas
KafkaProducer<>(props) Creates producer Main client that talks to Kafka
CustomerGenerator.getNext() Generates Avro object Produces mock data
new ProducerRecord<>(...) Wraps message Defines topic, key, and value
producer.send(record) Sends async Pushes data to Kafka
while (true) Infinite loop Keeps producing

โš™๏ธ What Happens Behind the Scenes๐Ÿ”—

  1. You produce a Customer Avro object.
  2. Serializer looks up (or registers) its schema in Schema Registry.
  3. Schema Registry returns a unique schema ID.
  4. Producer encodes:

[Magic Byte 0][Schema ID (4 bytes)][Avro serialized data]
5. Kafka stores the message. 6. Consumer later reads the message โ†’ uses Schema ID โ†’ fetches schema โ†’ deserializes back to a Customer object.


๐Ÿงฉ In Simple Terms๐Ÿ”—

This Java code:

Continuously generates random customer data, serializes each record in Avro format using Confluent Schema Registry, and sends it to a Kafka topic named customerContacts.


Customer Class is not a regular Plain Old Java Object but rather specialized one generated from a schema. We can generate these classes using avro-tools.jar or Maven plugin.