Data & System Architecture, from the ground up Lesson 25 / 80

Replication patterns: leader/follower, multi-leader, leaderless

The three families of database replication, the trade-offs each makes for consistency and availability, and where each fits in real systems.

Module 4 opens here. Module 3 was about which database to pick. Module 4 is about what happens once a single instance of that database is no longer enough: when one server cannot hold the data, when one server cannot serve the load, when one server failing means the whole application falls over. The patterns are the same regardless of whether your store is Postgres, MongoDB, Cassandra, or DynamoDB. They are also old. The distributed-systems literature has been working on them since the 1980s, and the trade-offs they make show up in every modern data store, sometimes by configuration and sometimes baked into the design.

This first lesson is about replication: keeping more than one copy of your data on more than one machine. The next lesson is about the consequences. The lesson after that is about partitioning, which is the orthogonal axis. Together they cover most of what you need to know about scaling a database horizontally.

Why replicate at all

Four reasons, in roughly the order teams discover them.

Durability. A single server can fail in many ways: disk dies, kernel panics, datacentre loses power, someone trips on a cable. If your only copy of the data lives on that machine, the failure is a data-loss event. If the data is replicated to a second machine in a separate failure domain, you survive the first failure with the data intact. This is the original reason replication exists.

Latency. A user in Singapore reading from a database in London pays around a hundred and seventy milliseconds of round-trip time per query. If the data is replicated into a Singapore region, the same read takes two milliseconds. Read replicas in geographically diverse regions trade some consistency for very large latency wins for read-heavy workloads.

Read throughput. A single primary can only serve so many reads per second before its CPU, IO, or network saturates. Adding read replicas spreads the read load across more machines. Writes still have to go through the primary, but in most workloads reads outnumber writes ten to a hundred to one, and read replicas absorb most of the pressure.

Upgrade and maintenance safety. When the primary needs a major-version upgrade, an OS patch, or new hardware, you do not want to take a multi-hour outage. With replication you can promote a follower to primary, do the work on the old primary, and switch back. The application sees a brief failover window instead of a long outage.

Three families of replication strategy cover essentially every design in production. They differ on one axis: where writes are accepted.

Leader/follower

One node is the leader. It accepts every write. It applies the write to its own storage, then ships the change to every follower. Followers apply the changes in the same order. Reads can go to any replica that is current enough for the application’s needs.

This is the standard pattern in Postgres, MySQL, MongoDB replica sets, Redis, SQL Server, Oracle, and most relational databases. The names vary: leader and follower, primary and secondary, master and slave (the older terminology, now usually replaced). The shape is the same.

Two important configuration knobs.

Synchronous versus asynchronous. Synchronous replication means the leader waits for at least N followers to confirm they have applied the write before the write is acknowledged to the client. Asynchronous means the leader acknowledges as soon as its own storage has the write, and ships to followers in the background. Synchronous gives stronger durability guarantees: if the leader dies the moment after acknowledging, the write is already on a follower. Asynchronous gives lower write latency but accepts a small window where an acknowledged write can be lost on leader failure. Postgres calls this synchronous_commit and lets you configure it per transaction. Most production deployments run a mix: synchronous to one nearby standby for durability, asynchronous to the further-away replicas for latency.

Failover. When the leader dies, somebody has to promote a follower to be the new leader. This is harder than it looks. The new leader has to be the most up-to-date follower, otherwise writes that were acknowledged are lost. Clients have to be told to send writes to the new leader. The old leader, when it comes back, has to be reconciled with the new state. If two nodes both believe they are the leader at the same time (a split-brain), you can get conflicting writes that the database has no way to merge. Production systems use coordination tools like Patroni, Orchestrator, or built-in mechanisms (MongoDB elections, Redis Sentinel) to make failover automatic and safe. Doing this by hand at 3am is the kind of operational story that ends up in postmortems.

The leader/follower pattern fits the relational mental model well: writes are linearised through one node, transactions have a clear order, the consistency story is simple. The cost is that writes do not scale beyond what the single leader can handle, and that the failover problem is ever-present.

Multi-leader

Two or more nodes accept writes. Each leader replicates its writes to the other leaders. Reads can go to any leader.

The motivating use cases are the ones that leader/follower handles awkwardly.

Multi-region active-active. You have users in three regions. With leader/follower, only one region has a writable database; users in the other two regions pay cross-region latency on every write. With multi-leader, each region has a leader, users write to their local one, and the leaders sync among themselves in the background. Write latency stays low everywhere.

Multi-datacentre on-prem. Same argument, applied to enterprise deployments with two or more datacentres that need to keep working independently if the link between them goes down.

Offline-first applications. Mobile apps, certain collaborative editors, the kind of system where each device is effectively a leader for its own writes and syncs with a central system or with peers when it next has connectivity. CouchDB and similar systems were designed around this pattern.

The hard problem with multi-leader is conflict resolution. Two leaders can accept conflicting writes during the window before they have synced. User A in London updates the order quantity to 5; user B in Frankfurt updates the same order’s quantity to 7; both writes are accepted; when the leaders sync, there are now two valid versions of the same record and the system has to decide which is correct. Three families of resolution.

Last-write-wins. Each write has a timestamp; the later one wins. Simple, but it depends on synchronised clocks (lesson 13’s problem in disguise) and silently discards data: one user’s update is lost without warning. Acceptable for some workloads (a user’s last-known location, a “last seen at” field), unacceptable for most.

CRDTs (Conflict-free Replicated Data Types). Mathematical structures designed so that any sequence of operations from any pair of replicas merges to the same final state, regardless of order. Counters, sets, ordered lists, JSON-like documents all have CRDT formulations. Used in Riak, Redis CRDB, Automerge, Yjs. Powerful but constrains your data model: you have to express your state as CRDT operations, not as arbitrary updates.

Manual or application-level resolution. The system surfaces conflicts to the application or to the user, and they choose. Git’s merge conflicts are the canonical example. Useful for cases where the application has the context to make a sensible choice, but expensive in operational complexity.

Tools in production: Postgres BDR (bidirectional replication), MySQL Group Replication, Couchbase, Riak, certain CockroachDB topologies. Multi-leader is real, it works, but the conflict story is what you are signing up for, and the operational maturity required is significantly higher than leader/follower.

Leaderless

There is no leader. Any node accepts any write. The client (or a coordinator on the client’s behalf) sends the write to N replicas; reads query N replicas; a quorum decides what the value is. This is the Dynamo-style design from Amazon’s 2007 paper, and it is the basis for Cassandra, ScyllaDB, DynamoDB, Riak, and most distributed key-value stores at large scale.

The mechanics rest on three numbers: N (the number of replicas a write goes to), W (the number that must acknowledge for a write to be considered successful), and R (the number that must respond for a read to be considered successful). The application chooses W and R per request, balancing latency against consistency.

The famous rule is W + R > N. If the number of replicas you wrote to plus the number of replicas you read from is greater than the total replica count, then any read is guaranteed to overlap with the most recent write, and you can read your own writes consistently. With N=3, W=2 and R=2 satisfies this and is the common default. W=1 and R=1 (eventual consistency) is faster but allows stale reads. W=3 and R=1 makes writes slower but reads cheaper.

When replicas disagree, the system needs a way to converge. Two mechanisms.

Read repair. When a read returns inconsistent values from different replicas, the coordinator picks the latest (by version vector or timestamp), returns it to the client, and pushes the corrected value back to the replicas that were stale. Repair happens on the read path, opportunistically.

Anti-entropy. A background process compares replicas (typically using Merkle trees so the comparison is cheap) and synchronises any differences it finds. This catches values that no read happened to touch.

The trade-off table for leaderless: writes succeed during many failure modes that would block a leader-based system, because as long as any W replicas are reachable, the write goes through. There is no failover, because there is no leader to fail over from. The cost is that the consistency model is eventual by default, and the application has to think carefully about W, R, and conflict resolution.

The three patterns side by side

flowchart LR
    subgraph LF[Leader/follower]
        L1[Leader] --> F1[Follower]
        L1 --> F2[Follower]
        L1 --> F3[Follower]
    end
    subgraph ML[Multi-leader]
        M1[Leader A] <--> M2[Leader B]
        M2 <--> M3[Leader C]
        M1 <--> M3
    end
    subgraph LL[Leaderless]
        N1[Node] <--> N2[Node]
        N2 <--> N3[Node]
        N1 <--> N3
        N3 <--> N4[Node]
        N1 <--> N4
        N2 <--> N4
    end

The three families make different choices on four axes worth holding in your head.

AxisLeader/followerMulti-leaderLeaderless
Writes during leader failureBlocked until failoverContinue on other leadersContinue, no leader to fail
Default consistencyStrong on leader, lagged on followersEventual (conflicts possible)Eventual, tunable via W and R
Geographic spreadOne writable regionAll regions writableAll nodes writable
Conflict-resolution complexityLow (no concurrent writes)High (CRDTs or app logic)Medium (versioning and quorum)

A pragmatic guide: leader/follower is the default for most applications, and most teams should stop reading there. Multi-leader earns its complexity when geographic write latency or true active-active is a hard requirement. Leaderless makes sense for very large scale, very high availability requirements, and key-value workloads where the consistency story can be designed around quorums.

What the next lessons unpack

Lesson 26 takes the most common production pattern, asynchronous leader/follower, and looks at what its trade-offs actually feel like to a user. The phrase is “replication lag”, and the consequence is the user-saw-stale-data bug that every team encounters and every team handles slightly differently.

Lesson 27 turns to the orthogonal axis: partitioning. Where replication keeps more than one copy of the same data, partitioning splits the data across nodes so that each node only holds a piece. Most real production deployments do both at once, and the choices compose.

Citations and further reading

  • Martin Kleppmann, Designing Data-Intensive Applications (O’Reilly, 2017), Chapter 5. The reference treatment of replication patterns; everything in this lesson is a compression of that chapter.
  • Giuseppe DeCandia et al., “Dynamo: Amazon’s Highly Available Key-value Store”, SOSP 2007, https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf (retrieved 2026-05-01). The paper that defined the leaderless quorum design.
  • Postgres documentation, “High Availability, Load Balancing, and Replication”, https://www.postgresql.org/docs/current/high-availability.html (retrieved 2026-05-01). The reference for streaming replication, synchronous commit, and failover patterns.
  • AWS DynamoDB Developer Guide, “Best Practices for Designing and Architecting with DynamoDB”, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/best-practices.html (retrieved 2026-05-01). The leaderless design in production.
  • MongoDB documentation, “Replication”, https://www.mongodb.com/docs/manual/replication/ (retrieved 2026-05-01). Replica sets as a leader/follower implementation with automatic election.
Search