# Checkpoints and Savepoints in Flink\

Here is the clearest and most complete explanation of checkpoints vs savepoints in Flink — one of the most important topics for real-time systems and interviews.

1. What are checkpoints in Flink?🔗

A checkpoint is an automatic, periodic, lightweight snapshot of the job’s state.

Flink uses checkpoints to recover from failures without data loss.

Key characteristics of checkpoints:🔗

Triggered automatically (e.g., every 1 minute)
Managed by the Job Manager
Used for fault recovery, not for stopping jobs
Taken while the job is running continuously
Stored in checkpoint storage (S3, HDFS, etc.)
Flink automatically deletes old checkpoints

Purpose:🔗

Recover the job automatically after failures with exactly-once guarantees.

Example:🔗

If your job crashes at 12:05 and last checkpoint was at 12:04:

Job restarts from the 12:04 checkpoint
No double-processing
No lost data

2. How checkpoints work (simplified)🔗

Job Manager triggers a checkpoint
All Task Managers snapshot their operator and keyed state
Snapshot goes to durable storage (S3, HDFS)
Offsets/positions in Kafka/file sources are also saved
When failure happens, Flink restores from this snapshot

This ensures exactly-once consistency.

3. What are savepoints in Flink?🔗

A savepoint is a manual, intentional snapshot of the job’s state.

It is triggered by the user, not the system.

Key characteristics of savepoints:🔗

Triggered manually (CLI, REST API)
Used for job upgrades, migrations, changes
Stable and long-lived
Stored wherever you choose
Not automatically deleted
Internally similar to checkpoints, but more portable and versioned
Might be larger/heavier than checkpoints

Purpose:🔗

Perform controlled upgrades, rescaling, or moves of a Flink job without losing state.

4. Why savepoints exist🔗

Kafka consumers, Flink jobs, and streaming pipelines often need upgrading:

New code
New logic
New parallelism
Migrating to a new cluster
Patching bugs

You cannot rely on automatic checkpoints for this because checkpoints are:

Not stable
Deleted automatically
Not intended for upgrades
Not portable between versions sometimes

Savepoints are explicit snapshots that the user manages.

5. Key differences between checkpoints and savepoints🔗

Feature	Checkpoint	Savepoint
Triggered by	Flink (automatic)	User (manual)
Purpose	Fault recovery	Upgrades & maintenance
Retention	Automatically cleaned	Must be preserved manually
Frequency	Regular (e.g., every minute)	Occasional
Portability	Not guaranteed	More portable
Job stopping needed?	No	Often yes (stop-with-savepoint)
Can resume job?	Yes	Yes (recommended for upgrades)

6. When do you use each?🔗

Use checkpoints for:🔗

Automatic failure recovery
Exactly-once guarantees
Continuous stability

Use savepoints for:🔗

Deploying new versions of the pipeline
Changing parallelism
Migrating clusters
Debugging production jobs
Pausing a job for maintenance

7. Example workflows🔗

A. Failure recovery🔗

Job fails → Automatically restarts → Loads last checkpoint

B. Code upgrade with state preservation🔗

1. Stop job with savepoint
2. Deploy new code
3. Restore from savepoint

C. Rescale a job🔗

savepoint → restore with new parallelism

This moves state partitions to the new number of Task Managers/slots.

8. One-line summary🔗

Checkpoint = automatic, frequent, lightweight snapshots for fault recovery
Savepoint = manual, durable snapshot for upgrades, rescaling, and controlled restarts