Data & System Architecture, from the ground up Lesson 46 / 80

CDC (Change Data Capture) and the dual-write problem

Debezium, Maxwell, AWS DMS. The dual-write problem and the outbox pattern that solves it.

The previous two lessons treated streams as a thing that already exists. Events arrive at Kafka, the processor consumes them, watermarks advance, transactions commit. None of those lessons asked the obvious upstream question: where do the events come from in the first place?

For some systems the answer is “a service produces events directly.” A user clicks “buy now,” the order service writes a OrderPlaced event to Kafka, downstream consumers do their thing. That works when the only consequence of the click is the event. It stops working the moment the order also has to be written to a database, which is most of the time, because services in real systems own state and the state lives in OLTP databases.

The moment a service has to update its database and publish an event, you are facing the dual-write problem. It is one of those problems that looks trivial at first and is actually a small distributed-transactions question in disguise, with the same answers (and the same trade-offs) as every other distributed-transactions question in this course.

The dual-write problem

The naive shape:

def place_order(order):
    db.insert(order)
    kafka.publish("orders", OrderPlaced(order))

Two operations. Two systems. No shared transaction. There is no protocol that makes “write to Postgres and write to Kafka” atomic, because Postgres and Kafka have separate notions of commit and no shared coordinator. The pair of writes either both succeed, or both fail, or one succeeds while the other fails.

The first two outcomes are fine. The third is the bug. There are two flavours of it.

DB succeeds, Kafka fails. The order is in the database. The event is not in Kafka. Downstream consumers (warehouse, billing, email confirmation) never hear about the order. The user sees a confirmation page (because the API returned 200 after the DB write), but their package never ships. The state of the world is inconsistent with what the rest of the system thinks the state of the world is.

Kafka succeeds, DB fails. The event is in Kafka. The order is not in the database. Downstream consumers process a phantom order. The warehouse picks an order that does not exist. Billing charges a card for a row that nobody can look up. The system has told the world a lie.

Both flavours happen in production. They happen because the network between the service and one of the two systems blips at exactly the wrong moment, because a process is killed between the two writes, because a deploy lands mid-flight, because the database is briefly under heavy load and the second write times out. The window is small, but it is not zero, and at scale “small probability” turns into “happens every Tuesday.”

The instinct is to wrap the two writes in a try/catch and “rollback” if the second one fails. That instinct is wrong. If the Kafka publish fails, you can roll back the database transaction, sure. If the database commit succeeds and then the Kafka publish fails, you cannot roll back the database, because it has already committed. If the Kafka publish succeeds and the database fails, you cannot un-publish from Kafka. There is no symmetric rollback available, because the two systems do not coordinate.

The failure mode you cannot avoid by code-level defensiveness is the one where the network partitions you mid-flight. The first write returned success. The process is killed. The second write never happens. There is no finally block in the world that fixes this, because the process is not running.

The dual-write problem has to be solved at the architecture layer, not the function layer. Two patterns solve it well, and they target slightly different shapes of the problem. Both are common in production. Many systems use both, for different event types.

Solution 1: Change Data Capture

CDC inverts the problem. Instead of asking the application to write to two systems, you let the application write to one, the database, and you let a separate process turn the database’s commit log into events.

Every transactional database keeps a write-ahead log of every change it has ever made. Postgres calls it the WAL. MySQL calls it the binary log. SQL Server, Oracle, and the cloud databases all have their equivalents. The log is the source from which the database itself reconstructs state on crash recovery and replicates to read replicas. It is durable, ordered, and includes every committed change.

CDC tooling reads that log, turns each row change into a structured event, and publishes the event to Kafka. The application service writes to the database normally. It does not know CDC exists. The database commit and the event publication are no longer two writes the application has to coordinate; the event publication is a downstream consequence of the commit, derived by reading the same log the database itself uses for replication.

The dominant tool in the open-source world is Debezium, a CDC framework built on Kafka Connect. Debezium has connectors for Postgres, MySQL, SQL Server, MongoDB, Oracle, and several others. Each connector parses the database-specific log format and emits a normalised event shape (before/after row state, operation type, source metadata) to a Kafka topic, one topic per source table by default.

Other tools cover similar ground. Maxwell is a long-running MySQL-only CDC tool. AWS DMS offers managed CDC across most major databases, with output to Kinesis or directly to S3 or Redshift. Flink CDC wraps Debezium connectors so they can be used as Flink sources directly. Confluent Cloud sells a hosted Debezium-equivalent. The choice between them is a function of which database you are reading from, whether you want managed or self-hosted, and what you are doing on the consumer side.

flowchart LR
    App[Application service] -->|INSERT/UPDATE/DELETE| DB[(Postgres)]
    DB -->|WAL| CDC[Debezium connector]
    CDC -->|row-change events| Kafka[(Kafka topics)]
    Kafka --> C1[Warehouse consumer]
    Kafka --> C2[Search index consumer]
    Kafka --> C3[Analytics consumer]

CDC has two strong properties. First, it captures every change to the database, including changes made by other services, by manual SQL, by data migrations, by anything that hits the database. The event stream is a complete log of state changes, which is exactly what a downstream search index, cache invalidator, or analytics warehouse usually wants. Second, the application code does not change at all. Adding CDC to a system is a deployment-layer change, not a code-layer change.

It also has limits. CDC events describe row-level changes (“the orders table now has this row in this state”), not business-level intent (“an order was placed and shipped to address X with discount code Y”). Reconstructing intent from row-level changes is sometimes easy and sometimes a nightmare. If five rows change inside a single transaction, the CDC stream has five events. The downstream consumer has to know that they go together. Some Debezium configurations preserve transaction boundaries, but consuming them requires care.

CDC also exposes the database schema directly. Every column rename, every dropped column, every type change, becomes a breaking change to the event schema. Teams downstream of the CDC stream find themselves coupled to the upstream database schema in a way they were not when the upstream service was emitting business events. The coupling is real and has to be managed, often with a translation layer that converts CDC events into stable business events before they cross team boundaries.

Solution 2: the transactional outbox

The outbox pattern keeps the application in charge of what events it emits, while still solving the dual-write problem.

The shape: in the same database transaction that writes the business state, the application writes a row to a dedicated outbox table. The outbox row contains the event payload, an event ID, the destination topic, and a timestamp. The transaction commits both rows together. There is a single write, to a single system, and the database’s transaction guarantees handle the atomicity.

A separate process, the outbox poller, periodically scans the outbox table for unpublished rows, publishes each one to Kafka, and marks it published (or deletes it). The poller is at-least-once: it might publish a row, crash before marking it published, and on restart publish the same row a second time. That is fine, because the consumers are idempotent (lesson 16 and lesson 45) and dedupe by the event ID that the application generated when it wrote the outbox row.

flowchart LR
    App[Application service] -->|tx: insert order + outbox row| DB[(Postgres)]
    DB -->|both committed atomically| DB
    Poller[Outbox poller] -->|SELECT FROM outbox WHERE published=false| DB
    Poller -->|publish event| Kafka[(Kafka)]
    Poller -->|UPDATE outbox SET published=true| DB

The atomicity is what makes the pattern work. Either the order and the outbox row are both committed, or neither is. There is no failure mode where the order exists without a corresponding outbox entry. The publication to Kafka is a downstream concern that is allowed to be at-least-once, because the consumer absorbs duplicates.

The poller can be a separate small service, a cron-driven job, or, increasingly common, a Debezium connector pointed at the outbox table. That last option is the elegant combination: the outbox solves the dual-write problem at the application layer, and Debezium’s CDC on the outbox table provides the publishing mechanism without anyone writing a custom poller.

The outbox pattern’s main cost is the extra table and the small operational overhead of the poller. The poller has to be monitored (a stuck poller means events are not flowing, and the outbox table grows unbounded). The outbox table needs an index on the unpublished column so polling is cheap. The poller has to handle ordering carefully if order matters per entity. None of these is hard. They are the kind of work you only have to do once per service.

When to use which

CDC and outbox solve overlapping problems with different shapes, and a mature system often uses both.

CDC is right when: the goal is to capture every change to a database, regardless of which code caused it. Search index synchronisation, analytics warehousing, cache invalidation, replication to a different region or stack. The consumer wants a complete log of state changes; it does not care about business intent.

Outbox is right when: the goal is to emit business-meaningful events, only at the moments the application chooses, with payloads designed for downstream consumers. OrderPlaced, PaymentRefunded, UserDeactivated. The consumer cares about what happened in business terms, not which rows changed in database terms. The application owns the event vocabulary.

A typical microservice landscape ends up with both. Debezium captures every change in every operational database and writes it to a tier of “data plane” topics that the analytics warehouse and the search system consume. The application services additionally write to outbox tables and emit business events to a tier of “service plane” topics that other services consume. The two tiers have different audiences, different schemas, and different change-management policies, and they coexist cleanly.

The thread

The dual-write problem is, at its core, a distributed-transaction problem. The lessons of Module 2 apply: there is no off-the-shelf 2PC across an application database and a message bus, the patterns that work are the ones that route around the impossibility, and the property you are after is end-to-end consistency, not commit-at-the-source-and-pray.

CDC routes around the impossibility by collapsing the two writes into one and reading the result from the database’s own log. Outbox routes around it by collapsing the two writes into one transaction, with the publication a downstream consequence. Both are pragmatic, both are widely used, and both replace a fundamentally unsafe pattern (try to write to two systems and hope) with a safe one (write to one system and let a separate process derive the rest).

Module 6 has now covered streaming end to end: the engines, the topologies, the state, the time, the delivery semantics, and the bridge from OLTP databases into the streaming layer. The next lesson moves to ordering, partitioning, and back-pressure, which are the operational realities every streaming pipeline runs into once it has been live for more than a quarter and the workload has grown past what the initial sizing assumed.

Citations and further reading

  • Debezium documentation, “Debezium architecture” and “Outbox event router”, https://debezium.io/documentation/ (retrieved 2026-05-01). The reference for both the CDC mechanics and the Debezium outbox SMT.
  • Gunnar Morling, “Reliable Microservices Data Exchange With the Outbox Pattern”, Red Hat Developer (retrieved 2026-05-01). The canonical write-up of the outbox pattern in Java/Postgres terms.
  • Confluent, “Patterns for streaming microservices” (retrieved 2026-05-01). Covers dual-write, outbox, and CDC in the streaming-microservices context.
  • AWS Database Migration Service documentation, “Working with change data capture (CDC)”, https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html (retrieved 2026-05-01).
  • Chris Richardson, “Microservices Patterns” (Manning, 2018). Chapter on transactional messaging and the outbox pattern. Retrieval 2026-05-01.
Search