Data & System Architecture, from the ground up Lesson 38 / 80

Idempotent batch: making jobs safely re-runnable

Overwrite vs append vs upsert. The MERGE pattern. Why 'this job ran twice' should be a non-event.

Lesson 16, in Module 2, covered idempotency for messaging: how to design a consumer so that processing the same message twice is harmless. This lesson is the batch counterpart. The same idea, applied to scheduled jobs that read large slices of input and write large slices of output: running the same job twice should produce the same end state.

If you take one piece of advice from this whole module, take this one. Idempotent batch is the property that turns the rest of the data platform from a pile of one-shot scripts into something you can operate calmly. Failures stop being incidents. Backfills stop being expeditions. The phrase “this job ran twice” should make nobody flinch. Almost every operational pain in a data team that I have seen, traced to its root, comes down to a job somewhere that is not safely re-runnable.

Why it matters

Batch jobs fail. They fail because of network blips, transient cloud-provider issues, a worker node OOMing, an upstream dataset being late, a bug, or a deploy. The orchestrator (Airflow, Dagster, whatever) retries. If the first attempt got halfway through and wrote some output before crashing, and the retry runs the whole thing again, what does the output look like?

In a non-idempotent job, you now have two halves of the first attempt’s data plus the entire second attempt, all mixed together. You may have duplicate rows. You may have rows from a half-complete run that look real but are based on incomplete input. You probably do not know without inspecting which rows are which.

The same pattern shows up in three other places that are not failure-driven:

  • Backfills. You realise that last week’s run had a bug. You want to re-run it for the affected dates. If the job is idempotent, re-running for any date range produces correct output. If it is not, you have to first delete what is there, then re-run, hoping you delete the right thing.
  • Overlapping schedules. A daily job runs at 02:00. The 02:00 run for the 18th is delayed and finishes at 02:50. The 02:00 run for the 19th is now starting. If they read overlapping input, they had better not produce overlapping incorrect output.
  • Replays. Something downstream got corrupted and you need to rebuild it from upstream. Re-running every job in the dependency graph for the affected window had better produce the same answer as the original runs would have.

All four (failure retry, backfill, overlap, replay) are routine in any data team. All four work cleanly if your jobs are idempotent and become forensic exercises if they are not.

The good news: idempotency for batch is not hard. It is a small set of patterns, applied consistently. The rest of this lesson is those patterns.

The three write modes

Every batch job’s output write falls into one of three modes, and the mode determines the idempotency story.

Append. The job writes new rows to the existing table. Re-running adds the rows again. By default, this is not idempotent: a re-run produces duplicates. Append is fine when paired with deduplication downstream (the next consumer collapses duplicates by key) or when the application is happy with a log-shaped table that is allowed to contain duplicates. As a primary write mode for analytics tables, it is the worst choice.

Overwrite. The job replaces a partition or the entire table atomically with the output of the run. Re-running with the same input produces the same output, and the atomic replace means you never see a half-state. This is the simplest path to idempotency. The unit of overwrite is usually a partition, not the whole table: you write event_date=2026-01-21 and replace whatever was there.

Upsert / MERGE. The job inserts new rows and updates existing ones based on a key. Re-running with the same input is idempotent because the second run sees the rows the first run wrote and merges into them, leaving the same end state. This is the most flexible mode and the most expensive.

The rule of thumb: prefer overwrite for partition-shaped data, prefer MERGE for entity-shaped data, avoid plain append unless you have a specific reason.

flowchart TB
    subgraph A[Append + dedup]
      A1[Job writes new rows] --> A2[Downstream dedup]
      A2 --> A3[Idempotent end state]
    end
    subgraph O[Partition overwrite]
      O1[Job computes partition] --> O2[Atomic replace partition]
      O2 --> O3[Idempotent end state]
    end
    subgraph M[MERGE / upsert]
      M1[Job stages records] --> M2[MERGE INTO target ON key]
      M2 --> M3[Idempotent end state]
    end

The partition-overwrite pattern

This is the workhorse pattern of modern batch. Most analytics tables are partitioned by date, and most batch jobs run once per day to produce one date’s worth of output. The shape is:

INSERT OVERWRITE TABLE silver.orders PARTITION (event_date = '2026-01-21')
SELECT ...
FROM bronze.orders
WHERE event_date = '2026-01-21'

That statement is idempotent. Run it once, twice, ten times for the same date: the partition ends up with the same content each time. The table format (Delta, Iceberg, or Hudi) makes the partition replace atomic, so readers never see a half-replaced partition.

Backfilling is trivial. To re-run for the week of January 14th-21st, loop over the dates and run the job for each. The output of each run is bounded to its own partition, so the backfill cannot accidentally affect other dates’ data.

The discipline this pattern requires is that the job is a function of the partition: the same input partition produces the same output partition, with no dependencies on when the job runs or what other jobs have run. If the silver job for the 21st depends on what was in the silver table for the 20th, the function-of-partition property breaks, and re-running becomes hazardous.

This is why partition-aligned batch jobs in modern stacks are almost always written as read partition X, transform, overwrite partition X. The simplicity of the contract is the whole point.

The MERGE pattern

For tables that are not partition-shaped, the partition-overwrite trick does not work. A silver.customers table has one row per customer, not partitions of dated events. Updating a customer’s record means changing one row in place, not replacing a partition.

The MERGE pattern, supported by Delta, Iceberg, and Hudi (lesson 37), is the idempotent answer:

MERGE INTO silver.customers AS t
USING staging.customers_today AS s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

The MERGE is atomic at the table-format level: either the whole merge applies or none of it does. It is idempotent because the second run sees the rows the first run wrote, matches them, and updates them to the same values. The end state after run number two is identical to the end state after run number one.

Two practical notes. First, the ON clause has to use a real key. If the key is not unique in the source, MERGE will fail or produce garbage depending on the engine; deduplicate the source first. Second, if you are merging changed records only (not the full snapshot), the output is correct only if the changed-records feed is itself complete. Missing one update on the source side means the target stays stale until the next time that record appears in the feed.

The watermark / high-water-mark pattern

Both partition-overwrite and MERGE need the job to know “what input am I processing this time”. For time-partitioned input, the answer is “the partition I am running for”. For a continuously-growing source where you want incremental processing, the answer is a watermark: the timestamp (or sequence number) of the most recent record processed in the last run.

The pattern:

  1. The job reads the watermark from its state store (often a metadata table in the lakehouse itself).
  2. It selects the input records with event_time > watermark.
  3. It processes them and writes the output (idempotently, via overwrite or MERGE).
  4. It advances the watermark to the maximum event_time it just processed.

Combined with MERGE, this becomes the canonical incremental-and-idempotent recipe. Each run picks up where the last one left off. Failures and reruns do not double-process: the watermark is advanced only after the output write succeeds, so a failed run leaves the watermark unchanged and the next attempt processes the same input again, idempotently.

Watch out for late-arriving data. If records can arrive with event_time earlier than the current watermark, a strict “greater than watermark” filter will miss them. The fix is either to lag the watermark behind the wall clock by a tolerance window (process records older than now() - 1h), or to allow the job to look back a fixed number of partitions on each run.

The append-with-dedup pattern

Sometimes append is forced on you. The cheapest write to an object-store-backed table, especially in a streaming-to-batch handoff, is an append. If you cannot avoid it, the dedup pattern preserves idempotency at the cost of a little more downstream work:

  1. Append the records, including the source’s natural key and an ingestion timestamp.
  2. The downstream consumer (or a subsequent batch step) groups by key and keeps the row with the latest ingestion timestamp.

The end state is idempotent: re-running the append step adds duplicates, but the dedup step collapses them. The cost is that the dedup is extra work on every read, or extra batch infrastructure to materialise a deduplicated view.

This is the pattern I recommend least. It works, and there are situations where it is the only realistic option, but it is operationally heavier than overwrite or MERGE. If you have a choice, choose one of the others.

Pitfalls

A few traps that make jobs that look idempotent quietly not idempotent.

Side effects that aren’t idempotent. A batch job that sends an email on each run, charges a credit card, or POSTs to a webhook is not idempotent: re-running re-sends, re-charges, re-posts. Move the side effect out of the batch path, or wrap it in an idempotency-key check (lesson 16) so the receiving system absorbs duplicates.

Random or wall-clock dependencies. A job that uses now() to label records, or a random() to break ties, will produce different output on every run. The fix is to pin the timestamp to the schedule’s logical run-time (the orchestrator usually exposes this as a parameter, e.g. execution_date or logical_date), and to seed any randomness from a deterministic source. The discipline is: every value the job writes should be a function of the input, not of when the job ran.

External resources that change. A job that hits an external API with a query like “give me the latest exchange rates” is not idempotent: re-running tomorrow returns different rates. If you cannot avoid this, snapshot the external data into bronze first, with a logical-date label, and have the rest of the job read from the snapshot. Bronze becomes the boundary between the non-idempotent outside world and the idempotent batch pipeline.

Order-dependent transforms. A transform that uses a row-number or rank without a deterministic tie-breaker can produce different output for the same input on different runs (the engine’s parallel execution may order tied rows differently each time). Add an explicit tie-breaker column. If two orders have the same timestamp, also order by order_id, so the rank is fully determined.

Skipping the watermark advance on failure. If your watermark is advanced before the output write commits, a crash between the two leaves the watermark moved past records that were never written. This is the canonical “lost data on retry” bug. Always advance the watermark in the same transaction as the output, or after the output is durable.

The discipline

The end state of all this: every batch job in your platform should be safely re-runnable for any input range, by anyone, at any time, with no coordination. The job’s owner should be able to say “yes, just run it for last week” without having to think about it. The orchestrator should be able to retry without triggering an alert. A bug, found in production three months in, should be fixable by a re-run.

That property is built one job at a time, by the patterns above and by team norms that say “if it isn’t idempotent, it isn’t done”. It is the foundation that the next lesson (backfills and replays) and most of the rest of the module rely on. Without it, every later operation is a careful exercise in not breaking production. With it, the platform stays calm.

Search