1. What is Parquet?π
Apache Parquet is a columnar storage file format designed for:
- Big data processing
- Analytical workloads (OLAP)
- Efficient compression and query performance
Instead of storing data row by row, it stores data column by column
2. Row vs Column Storage (Core Idea)π
Row-based (CSV, JSON)π
Columnar (Parquet)π
3. Why Columnar Mattersπ
3.1 Reads only required columnsπ
Parquet reads only:
NOT entire row
3.2 Better compressionπ
- Same column β similar data
- Compression algorithms work better
Example:
3.3 Faster aggregationπ
Only age column scanned β faster
4. Parquet File Structure (Deep Dive)π
A Parquet file is not just βdataβ; it has a hierarchical structure:
File
βββ Row Groups
β βββ Column Chunks
β β βββ Pages
β
βββ Footer (Metadata)
4.1 Row Groupπ
- Horizontal partition of data
- Contains multiple rows
- Typical size: 128 MB (configurable)
Each row group is processed independently β parallelism
4.2 Column Chunkπ
Inside each row group:
- Data is stored column-wise
Example:
4.3 Pages (smallest unit)π
Each column chunk is split into pages:
- Data pages
- Dictionary pages
This enables:
- Fine-grained reading
- Compression
4.4 Footer (Very Important)π
Stored at the end of the file
Contains:
- Schema
- Column metadata
- Statistics (min, max, null count)
- Row group locations
Spark reads footer first to decide:
- What to scan
- What to skip
5. Predicate Pushdown (Huge Advantage)π
Example:
Parquet metadata contains:
Spark skips RowGroup1 entirely
This is called: Predicate Pushdown
6. Encoding Techniques in Parquetπ
Parquet uses smart encoding before compression:
6.1 Dictionary Encodingπ
6.2 Run Length Encoding (RLE)π
6.3 Bit Packingπ
- Uses minimal bits for integers
These reduce size before compression
7. Compression in Parquetπ
Common codecs:
- Snappy (default in Spark)
- Gzip
- LZO
- ZSTD
Why compression works well:π
- Columnar + encoding β highly compressible
8. How Spark Uses Parquetπ
8.1 Schema inferenceπ
No extra job needed Schema is in footer
8.2 Column pruningπ
Only reads required columns
8.3 Predicate pushdownπ
Uses footer stats β skips data
8.4 Partition pruning (directory level)π
Spark skips entire folders
9. Why Parquet is Faster than CSV/JSONπ
| Feature | CSV/JSON | Parquet |
|---|---|---|
| Storage | Row | Column |
| Compression | Poor | Excellent |
| Schema | Not stored | Stored |
| Column pruning | No | Yes |
| Predicate pushdown | No | Yes |
| Read speed | Slow | Fast |
10. Real Example (Spark Execution)π
What Spark actually does:π
- Reads footer
-
Identifies:
-
Only
name,ageneeded -
Applies:
-
Predicate pushdown on
age - Skips unnecessary row groups
- Reads only required column chunks
11. Key Interview Pointsπ
11.1 Why is Parquet columnar?π
β Improves read efficiency for analytics
11.2 What is Row Group?π
β Horizontal partition enabling parallel reads
11.3 What is stored in footer?π
β Schema + metadata + statistics
11.4 Why no inferSchema job?π
β Schema already stored
11.5 Why is Parquet efficient?π
β Column pruning + predicate pushdown + compression