Stateful Upgrades in Flinkπ
Stateful upgrades in Flink mean upgrading a running streaming job while preserving its state, without starting from scratch or losing data. This is one of Flinkβs most important capabilities because real-time pipelines often need continuous improvements.
Here is the clearest explanation.
1. What is a Stateful Upgrade?π
A stateful upgrade = deploying a new version of your Flink job using its existing state.
You are changing code, logic, or parallelism while keeping:
- Keyed state
- Window state
- Timers
- Operator state
- Kafka offsets
- Any other internal state
This prevents:
- Reprocessing from beginning
- Losing state
- Writing wrong results
- Outages
2. How Flink performs a stateful upgradeπ
Stateful upgrades are done using a savepoint, not a checkpoint.
Typical steps:π
Step 1: Trigger a savepointπ
You stop or suspend the job and generate a savepoint:
or via REST API.
This savepoint contains:
- Operator/Keyed state
- Metadata
- Offsets (Kafka sourceβs position)
Step 2: Deploy new job versionπ
Change your code:
- New business logic
- Fixed bugs
- Added operators
- Changed state type (carefully)
- Changed parallelism
Flink allows you to evolve operators as long as:
- Operator UIDs remain consistent
- State schema evolution is compatible
- Parallelism changes follow Flink rules
Step 3: Restore from savepointπ
You launch the new job by pointing to the savepoint:
The new job starts from the exact same state as previous version.
No data lost. No duplicates. No recomputation from start.
3. Why savepoints are required for stateful upgradesπ
Checkpoints are not suitable for upgrades because:
- They are deleted automatically
- They are optimized for recovery, not portability
- They may not be compatible across versions
- They arenβt stable references
Savepoints, on the other hand:
- Are stable snapshots
- Are kept forever unless manually deleted
- Are portable across code versions
- Are designed for upgrades, migration, scaling
4. What is required for successful stateful upgrade?π
A. Stable Operator IDs (UIDs)π
Each operator must have a fixed UID.
Example:
If UIDs change, Flink treats them as new operators β state cannot be restored β upgrade fails.
B. Compatible state schemaπ
If you change the type of the state (e.g., ValueState from int to string), Flink needs:
- State schema compatibility
- Or migration serializers
- Or manual transformations
C. Same partitioningπ
KeyBy fields must be consistent.
Example:
Old job:
New job:
This breaks state compatibility and Flink prevents upgrade.
D. Parallelism rulesπ
You can change parallelism during upgrades.
Flink will:
- Redistribute keyed state across new #slots
- Migrate state partitions
- Maintain correctness
This is called rescaling.
5. Stateful Upgrade Examplesπ
Example 1: Add a new field to event processingπ
Old job processed:
New job adds:
As long as UIDs are same and state serializer evolves correctly β upgrade works.
Example 2: Change parallelism from 4 to 10π
Flink transparently repartitions the state for the new parallelism.
Example 3: Upgrade business logicπ
Old:
New:
Flink restores old state and applies new logic.
6. Stateful vs Stateless upgradesπ
| Type | Requires savepoint? | Can change parallelism? | Keeps state? |
|---|---|---|---|
| Stateless upgrade | No | Yes | Not applicable |
| Stateful upgrade | Yes | Yes | Yes |
Most real pipelines (fraud detection, aggregations, CEP, joins, windows) require stateful upgrades.
7. What happens internally during stateful upgradeπ
When restoring from savepoint:
- Flink reads the metadata
- Matches operator UIDs with new job graph
- Loads state for each operator/key
- Reassigns state to subtasks based on new parallelism
- Initializes timers + windows
- Connects Kafka sources starting from saved offsets
- Starts processing from the exact previous point
This guarantees exactly-once with no downtime.
8. One-line Summaryπ
A stateful upgrade in Flink is deploying a new version of a streaming job while preserving all its state by restoring from a savepoint. It enables zero-downtime upgrades, parallelism changes, and safe evolution of long-running pipelines.