1. File Size Optimization (Most Important)🔗
Problem: Small files🔗
- Too many small Parquet files → too many tasks
- High scheduling overhead
- Poor I/O throughput
Target🔗
128 MB – 1 GB per file (sweet spot for Spark)
Fix🔗
Option 1: Repartition before write🔗
Option 2: Control file size directly🔗
Key Insight🔗
- More files ≠ faster
- Balanced file sizes = optimal parallelism
2. Partitioning Strategy (Critical)🔗
Partitioning = directory-level organization
```id="5z7r2w" /data/year=2025/month=04/
Works well when:🔗
- Column used in filters
- Moderate cardinality
Bad Partitioning🔗
Problems:
- Millions of folders
- Metadata explosion
- Slow queries
Rule of Thumb🔗
| Column Type | Use for Partition? |
|---|---|
| Date | Yes |
| Region | Yes |
| User ID | No |
3. Column Pruning Optimization🔗
Always select only needed columns:
Why:
- Parquet reads only required columns
- Reduces I/O significantly
4. Predicate Pushdown🔗
Write queries like:
Spark:
- Uses Parquet metadata (min/max)
- Skips entire row groups
Important Tip🔗
Avoid this:
Breaks pushdown
5. Compression Optimization🔗
Recommended codecs🔗
| Codec | Use Case |
|---|---|
| Snappy | Default, fast |
| Gzip | Better compression, slower |
| ZSTD | Best balance |
Set compression🔗
6. Row Group Size Tuning🔗
Row groups affect:
- Parallelism
- Predicate pushdown efficiency
Default🔗
~128 MB
Tune if needed🔗
Insight🔗
- Larger row group → better compression
- Smaller → better skipping
7. Sorting Data Before Writing🔗
Sorting improves:
- Compression
- Predicate pushdown efficiency
Example🔗
Why:
- Values in column become clustered
- Min/max stats become more useful
8. Avoid Python UDF Before Writing🔗
Bad:
Why:
- Breaks optimization
- Slower execution
Better:
- Use Spark SQL functions
9. Merge Small Files (Compaction)🔗
If data already exists:
Production approach🔗
- Run periodic compaction jobs
- Especially in streaming pipelines
10. Partition Pruning (Query Side)🔗
Query like:
Spark reads only:
```id="tgnxaz" /year=2025/
Breaks pruning
11. Schema Optimization🔗
Use correct data types🔗
Bad:
Good:
Why:
- Better compression
- Faster comparisons
12. Caching (When Reused)🔗
Use when:
- Same data reused multiple times
13. Advanced: Bucketing (Less common now)🔗
Useful for:
- Joins
14. Real-World Optimization Example🔗
Bad pipeline🔗
Problems:
- Small files
- No partitioning
- No compression tuning
Optimized pipeline🔗
df = df.repartition(200)
df.write \
.partitionBy("year", "month") \
.option("compression", "zstd") \
.parquet("path")
15. Key Interview Summary🔗
If asked “How to optimize Parquet?”:
You can answer:
- Control file size (avoid small files)
- Use proper partitioning (low/moderate cardinality)
- Enable predicate pushdown
- Use column pruning
- Choose efficient compression (Snappy/ZSTD)
- Sort data before writing
- Avoid UDFs in pipeline
- Periodically compact files