Skip to content

1. What is Parquet?πŸ”—

Apache Parquet is a columnar storage file format designed for:

  • Big data processing
  • Analytical workloads (OLAP)
  • Efficient compression and query performance

Instead of storing data row by row, it stores data column by column


2. Row vs Column Storage (Core Idea)πŸ”—

Row-based (CSV, JSON)πŸ”—

Row1: id, name, age
Row2: id, name, age

Columnar (Parquet)πŸ”—

Column id   β†’ [1, 2, 3]
Column name β†’ [A, B, C]
Column age  β†’ [20, 25, 30]

3. Why Columnar MattersπŸ”—

3.1 Reads only required columnsπŸ”—

df.select("name")

Parquet reads only:

name column

NOT entire row


3.2 Better compressionπŸ”—

  • Same column β†’ similar data
  • Compression algorithms work better

Example:

age: 25, 25, 25, 25 β†’ highly compressible

3.3 Faster aggregationπŸ”—

SELECT avg(age)

Only age column scanned β†’ faster


4. Parquet File Structure (Deep Dive)πŸ”—

A Parquet file is not just β€œdata”; it has a hierarchical structure:

File
 β”œβ”€β”€ Row Groups
 β”‚     β”œβ”€β”€ Column Chunks
 β”‚     β”‚     β”œβ”€β”€ Pages
 β”‚
 └── Footer (Metadata)

4.1 Row GroupπŸ”—

  • Horizontal partition of data
  • Contains multiple rows
  • Typical size: 128 MB (configurable)

Each row group is processed independently β†’ parallelism


4.2 Column ChunkπŸ”—

Inside each row group:

  • Data is stored column-wise

Example:

Row Group 1:
   id column chunk
   name column chunk
   age column chunk

4.3 Pages (smallest unit)πŸ”—

Each column chunk is split into pages:

  • Data pages
  • Dictionary pages

This enables:

  • Fine-grained reading
  • Compression

Stored at the end of the file

Contains:

  • Schema
  • Column metadata
  • Statistics (min, max, null count)
  • Row group locations

Spark reads footer first to decide:

  • What to scan
  • What to skip

5. Predicate Pushdown (Huge Advantage)πŸ”—

Example:

df.filter("age > 30")

Parquet metadata contains:

RowGroup1: age min=10, max=20
RowGroup2: age min=25, max=60

Spark skips RowGroup1 entirely

This is called: Predicate Pushdown


6. Encoding Techniques in ParquetπŸ”—

Parquet uses smart encoding before compression:

6.1 Dictionary EncodingπŸ”—

name column:
["A", "A", "B", "A"]

Dictionary:
A β†’ 0
B β†’ 1

Stored as:
[0, 0, 1, 0]

6.2 Run Length Encoding (RLE)πŸ”—

[25, 25, 25, 25] β†’ (25, count=4)

6.3 Bit PackingπŸ”—

  • Uses minimal bits for integers

These reduce size before compression


7. Compression in ParquetπŸ”—

Common codecs:

  • Snappy (default in Spark)
  • Gzip
  • LZO
  • ZSTD

Why compression works well:πŸ”—

  • Columnar + encoding β†’ highly compressible

8. How Spark Uses ParquetπŸ”—

8.1 Schema inferenceπŸ”—

spark.read.parquet("path")

No extra job needed Schema is in footer


8.2 Column pruningπŸ”—

df.select("name")

Only reads required columns


8.3 Predicate pushdownπŸ”—

df.filter("age > 30")

Uses footer stats β†’ skips data


8.4 Partition pruning (directory level)πŸ”—

/data/year=2025/month=04/

Spark skips entire folders


9. Why Parquet is Faster than CSV/JSONπŸ”—

Feature CSV/JSON Parquet
Storage Row Column
Compression Poor Excellent
Schema Not stored Stored
Column pruning No Yes
Predicate pushdown No Yes
Read speed Slow Fast

10. Real Example (Spark Execution)πŸ”—

df = spark.read.parquet("data")

df.filter("age > 30").select("name").show()

What Spark actually does:πŸ”—

  1. Reads footer
  2. Identifies:

  3. Only name, age needed

  4. Applies:

  5. Predicate pushdown on age

  6. Skips unnecessary row groups
  7. Reads only required column chunks

11. Key Interview PointsπŸ”—

11.1 Why is Parquet columnar?πŸ”—

β†’ Improves read efficiency for analytics


11.2 What is Row Group?πŸ”—

β†’ Horizontal partition enabling parallel reads


β†’ Schema + metadata + statistics


11.4 Why no inferSchema job?πŸ”—

β†’ Schema already stored


11.5 Why is Parquet efficient?πŸ”—

β†’ Column pruning + predicate pushdown + compression