Data-format deep dive - Avro, CSV, JSON, Parquet - and how to tune Avro for performanceπ
1) Quick comparison (at-a-glance)π
Format | Schema | Layout | Splittable | Best for | Pros | Cons |
---|---|---|---|---|---|---|
CSV | none (ad-hoc) | row text | yes (if uncompressed / splittable compression) | simple exports, small datasets, ad-hoc loads | human readable, simple | no types, ambiguous parsing, large on disk, slow for big data, no nested types |
JSON / NDJSON | implicit (can infer) | row text (JSON objects) | yes (if newline-delimited + splittable compression) | logs, event transport, semi-structured data | flexible, supports nesting | heavier than binary, slow parsing, schema inference costly |
Avro | explicit schema in file | row-oriented binary (blocks) | yes β block syncronization markers make it splittable. | streaming / message formats, ingest layer, schema evolution | compact binary, schema evolution, fast (serialize/deserialize), splittable, compressible | row format (not ideal for OLAP), less efficient column pruning than Parquet |
Parquet | schema (stored in file metadata) | columnar | yes | analytics, OLAP, read-heavy workloads | excellent compression & IO for columnar reads, predicate pushdown, column pruning | heavier write cost, not ideal for single-record streaming |
2) More detail on each formatπ
CSVπ
- Plain text, each row is a line; columns separated by delimiter.
- No typed schema: everything is a string until you parse it. Date/number parsing is uncertain.
- Compression: you can gzip/bzip2 files β gzip is not splittable, so parallel reads on S3/HDFS are limited; bzip2 is splittable but slow.
- Use when: small/medium tabular exports, exchange with tools that expect CSV. Avoid as a canonical lake format.
JSON (and NDJSON / JSON Lines)π
- Supports nested records/arrays; NDJSON (one JSON object per line) is the scalable variant.
- Schema inference is slower and brittle β in Spark use an explicit schema to avoid full-file scans.
- Multiline JSON (arrays) is expensive and should be avoided for large datasets.
- Good for flexible ingestion / logs / event vocabularies, but heavy for analysis.
Avro (focus β internals and strengths)π
- Binary, row-oriented format. Each file contains a header with the writer schema, then series of blocks (many records), each block followed by a sync marker. Because compression happens per-block and sync markers mark block boundaries, Avro container files are splittable and suitable for parallel processing.
- Schema is stored in the file header, which enables robust schema evolution (writer/reader schema resolution, default values, etc.). Great for message buses (Kafka) and ingest pipelines.
- Compression: Avro supports block-level codecs (null/deflate/snappy/bzip2; many ecosystems also support xz/zstandard). Codec support depends on library/runtime. In Spark you can control codec via
compression
option orspark.sql.avro.compression.codec
. - When to use Avro: streaming/ingest, durable event storage, where schema evolution and small record serialization speed matter.
Parquetπ
- Columnar storage: data is stored by column chunks β huge IO savings when queries touch a few columns. Excellent compression (dictionary, bit-packing) and predicate pushdown. Best format for analytical queries and BI workloads. ([Upsolver][3])
3) Why Avro vs Parquet (practical pattern)π
- Use Avro for ingestion, event logs, Kafka messages, or when schema evolution and fast row serialization are required.
- Convert Avro -> Parquet for analytics and dashboards (Parquetβs columnar layout + predicate pushdown + column pruning gives much faster query times and lower IO).
4) Tuning Avro for performance β principles + actionable stepsπ
Below are the levers you can pull when you write/produce Avro files and when you read them (Spark/Hadoop/Java examples included).
A. Choose the right compression codecπ
Tradeoffs: CPU cost vs compression ratio vs read speed.
- snappy β fast compression/decompression, moderate ratio β generally best for throughput-sensitive pipelines.
- deflate (zlib) β better compression ratio; adjustable level (slower CPU).
- zstandard / xz / bzip2 β higher compression ratio but higher CPU; zstd is often a good space/speed compromise if available.
Recommendation: start with snappy for ingestion pipelines; consider deflate or zstd for archival data where IO cost is important. In Spark you can configure codec via
spark.sql.avro.compression.codec
oroption("compression", "<codec>")
. ([Apache Spark][5])
Spark example
// Option 1: set global conf
spark.conf.set("spark.sql.avro.compression.codec", "snappy")
spark.conf.set("spark.sql.avro.deflate.level", "6") // if using deflate
// Then write
df.write.format("avro").mode("overwrite").save("/mnt/data/orders.avro")
// Or explicitly (option is supported too)
df.write.format("avro").option("compression", "snappy").save("/mnt/data/orders.avro")
(If compression
option is not set, spark.sql.avro.compression.codec
is used.)
B. Tune block / sync interval (the Avro βblock sizeβ)π
- Avro groups many records into a block and writes a sync marker after each block. Larger blocks = fewer markers, better compression (more data to compress), and fewer IO operations; but larger blocks mean more memory used during compression and potentially larger re-read buffer.
- The Avro API exposes
setSyncInterval(...)
on theDataFileWriter
so you can adjust bytes per block. Suggested values historically range from a few KB up to a few MB; Avroβs docs recommend values between \~2KB and 2MB, and a commonly used default was increased to 64KB for better compression. ([Apache Avro][6])
Java/Scala example (low-level writer)
import org.apache.avro.file.{CodecFactory, DataFileWriter}
import org.apache.avro.generic.{GenericDatumWriter, GenericRecord}
// create writer...
val datumWriter = new GenericDatumWriter[GenericRecord](schema)
val dfw = new DataFileWriter(datumWriter)
dfw.setCodec(CodecFactory.snappyCodec()) // choose codec
dfw.setSyncInterval(64 * 1024) // e.g. 64KB block size
dfw.create(schema, new java.io.File("orders.avro"))
// write records...
dfw.close()
(Use larger sync interval for higher throughput/space efficiency; use smaller interval for lower memory footprint and faster single-record availability.)
C. File size and partitioningπ
- Avoid tiny files - small-file overhead kills performance on distributed filesystems and object stores. Aim for hundreds of MB per file (common heuristic: 128MBβ1GB depending on workload and compute). For Avro ingestion, batch/compact files to reasonable sizes.
- Use partitioning on high-cardinality but selectively chosen columns (date/year/month, region) to enable pruning on read.
D. Schema design and typesπ
- Prefer primitive types where possible (ints/longs/dates) instead of stringly-typed fields. Use Avro logical types (date/timestamp/decimal) appropriately to avoid costly conversions.
- Avoid very deep nesting or very wide records if your queries often need only a small subset of fields - Avro is row-oriented, so you pay to read full rows. For analytics consider storing the same data in Parquet.
E. Read-time optimizations in Sparkπ
- Partition pruning (partition columns) is your friend; Avro does not give the same column pruning benefits as Parquet, so pruning by file layout is essential.
- Use predicate pushdown only where supported by the engine - Parquet has stronger pushdown/pruning. For Avro rely on partitioning + filter pushdown at file level when possible.
- Avoid schema inference on read - pass explicit schemas to
spark.read.format("avro").schema(mySchema).load(...)
to speed parsing.
F. Compaction + lifecycleπ
- For continuous ingestion (many small Avro files), run periodic compaction jobs to merge small files into larger Avro files with tuned block sizes and codec settings. This reduces metadata overhead and improves read throughput.
5) Concrete tuning checklist β quick actions you can apply nowπ
- Codec: use
snappy
for throughput;deflate
/zstd
for smaller archive sizes. (Spark:spark.conf.set("spark.sql.avro.compression.codec", "snappy")
.) - Block size / sync interval: increase from default to e.g. 64KBβ1MB (measure CPU/memory). Use
DataFileWriter.setSyncInterval(...)
for low-level writers. - File size: avoid tiny files β compact to 128MBβ1GB per file.
- Partitioning: partition on query-selective columns so reads can prune files.
- Schema: provide explicit schemas at read time; use logical types for decimals/timestamps.
- Convert to Parquet for analytics: ingest with Avro, transform to Parquet for OLAP workloads. Parquet gives column pruning and faster scans.
6) Small worked example β ingest Avro then convert to Parquet (Spark)π
// 1) write Avro (ingest)
spark.conf.set("spark.sql.avro.compression.codec", "snappy")
dfIngest.write.format("avro")
.mode("append")
.save("/lake/landing/orders/")
// (compaction step can merge small Avro files here)
// 2) convert to Parquet for analytics (optimised)
val avroDF = spark.read.format("avro").load("/lake/landing/orders/")
avroDF.write.mode("overwrite").partitionBy("dt").parquet("/lake/curated/orders_parquet/")
This pattern gives you the ingest-friendly Avro layer + analytics-friendly Parquet layer.
7) Limitations / when Avro tuning is not enoughπ
- If analytic queries require selecting a few columns from very wide rows, Avro will still read whole rows β use Parquet/ORC.
- If your workloads are read-heavy and interactive, convert to columnar formats and use appropriate file sizes and clustering.