Consensus: Paxos and Raft, in plain English

The previous two lessons covered consistency models (what reads can see) and time (how to order events). Both assumed something we have not yet justified: that the system can make decisions. That a group of replicas can agree on a value, on an order, on a leader, on a configuration. Without that ability, the rest of the spectrum collapses: you cannot offer linearizability without a way to agree on which write came first, you cannot offer leader-based replication without a way to agree on the leader, you cannot evolve the cluster’s membership without a way to agree on the new membership.

The technical name for this ability is consensus. Getting a group of nodes to agree on a single value, even when some are slow, missing, or telling lies, is the canonical hard problem of distributed systems. There are well-known impossibility results (FLP, in 1985, proved that no asynchronous protocol can guarantee both safety and liveness with even one faulty node) and well-known solutions that work in practice anyway (Paxos, Raft, and a small family of cousins). This lesson covers the two protocols you will actually encounter in production systems and the trade-offs that made one largely replace the other.

What consensus is, and what we use it for

The consensus problem, formally: a group of N nodes each propose a value. The protocol must end with every non-faulty node deciding on the same value, and that value must be one that someone proposed. The protocol must satisfy two properties:

Safety. All non-faulty nodes that decide, decide on the same value. The system never disagrees with itself, never commits two different values for the same slot.
Liveness. The protocol eventually terminates, and a value is decided. The system makes progress.

These two are in tension. A protocol that always halts at the first sign of trouble is safe but not live. A protocol that always picks something, even when it is unsure, is live but not safe. Real-world consensus protocols guarantee safety unconditionally and provide liveness “when conditions are good enough” (the network is mostly working, a majority of nodes are up, leaders are not being elected and re-elected in a loop). This is the practical compromise around the FLP impossibility result.

Why we want consensus, in concrete terms:

Leader election. Many systems are easier to reason about when one node is the leader for a given role. Choosing that leader, and re-choosing when the current leader fails, requires consensus.
Replicated logs. A consensus-backed log is the foundation of replicated state machines: every replica applies the same sequence of operations in the same order, and ends up in the same state. This is the standard way to build a fault-tolerant database.
Distributed locks and leases. When two clients race for the same lock, somebody has to decide who got it. That decision is consensus.
Configuration storage. The cluster’s membership, the topology, the feature flags. Anything where “every node must agree on the answer” lives in a consensus-backed store.

Almost every reliable distributed system you can name has a consensus protocol at its core, even if the marketing material does not mention it. Kubernetes runs on etcd, which runs Raft. Cloud Spanner runs Multi-Paxos under each tablet. Kafka, since the removal of ZooKeeper, runs its own Raft variant called KRaft. The protocol is the load-bearing wall.

Paxos

The first algorithm to solve consensus correctly in an asynchronous network with crash-stop failures. Leslie Lamport described it in a 1989 technical report under an elaborate metaphor about Greek parliaments, then formally published it in 1998 as “The Part-Time Parliament.” The metaphor was meant to be charming. It mostly succeeded in making the paper hard to follow. Lamport himself wrote a simpler version in 2001 called “Paxos Made Simple,” which is itself not simple.

Stripped of metaphor, basic Paxos works as follows. Three roles: proposers (who want to get a value chosen), acceptors (who form a quorum and vote on values), and learners (who find out what was chosen). One node usually plays multiple roles. A round of Paxos has two phases.

Phase 1 (Prepare). A proposer picks a unique increasing proposal number n and sends a “prepare” message to a majority of acceptors. Each acceptor that has not promised something with a higher number replies with a promise: “I will not accept any proposal with a number lower than n.” If the acceptor has already accepted a value in a previous round, it includes that value and its number in the reply.

Phase 2 (Accept). If the proposer receives promises from a majority, it picks a value to propose. If any acceptor reported a previously accepted value, the proposer must propose that value (specifically, the one with the highest accepted number). Otherwise, the proposer is free to propose its own value. It then sends an “accept” request with the chosen value. If a majority of acceptors accept, the value is chosen.

The protocol guarantees that once a value is chosen, no other value can be chosen for the same slot. The proof relies on the majority overlap: any two majorities of N acceptors must share at least one acceptor, and that overlapping acceptor’s promises and acceptances enforce the safety property.

For a single-decision system this is enough. Real systems need to make a sequence of decisions: every entry in the replicated log is its own consensus instance. Running basic Paxos for every entry is correct but expensive (two round-trips per decision). The optimization is Multi-Paxos: elect a stable leader, let the leader skip Phase 1 for subsequent rounds, and pay one round-trip per decision instead of two. Multi-Paxos is what production systems actually run.

Multi-Paxos has a reputation. It is correct, well-studied, and proven at scale. It is also widely considered hard to implement, hard to debug, and hard to teach. Several teams have published war stories about getting Paxos wrong in subtle ways: leader election races, off-by-one bugs in the proposal-number scheme, configuration changes that violated quorum overlap.

Raft

Diego Ongaro and John Ousterhout, working at Stanford, wrote a paper in 2014 titled “In Search of an Understandable Consensus Algorithm.” Their thesis was simple: Paxos is correct, but the field’s inability to teach it consistently is itself a bug. They designed Raft from scratch with understandability as the primary goal. The paper offers the same safety and liveness guarantees as Paxos but presents them in a way that an engineer can read once and implement.

Raft makes a few opinionated choices that simplify the protocol.

Strong leadership. Raft has a leader at all times (unless an election is in progress). All client requests go through the leader. The leader is responsible for appending to the log and replicating it to followers. There is no concept of “any node can propose”; the leader proposes, the followers acknowledge.

Three roles. Every node is in one of three states at a time: leader, follower, or candidate. A follower receives heartbeats from the leader and accepts log entries. A candidate is a node trying to become leader. A leader is the node currently in charge. State transitions are clearly defined: a follower becomes a candidate when it stops hearing from the leader, a candidate becomes a leader when it wins an election, a leader steps down when it sees evidence of a more recent term.

Term numbers. Raft divides time into terms, each beginning with an election. A term is identified by a monotonically increasing integer. Every message carries a term number, and any node that sees a higher term immediately steps down to follower and updates its term. This single mechanism replaces the proposal-number scheme of Paxos and is much easier to reason about.

Log matching. Raft’s log replication has a strong invariant: if two logs agree on the entry at index i, they agree on all entries up to i. The leader enforces this by sending entries with both the index and the previous index’s term, and followers reject any entry that does not match their own previous entry. This makes log inconsistencies easy to detect and repair.

The cost: Raft is slightly less flexible than Multi-Paxos in a few advanced scenarios. Multi-Paxos can in principle commit log entries out of order, batch decisions in unusual ways, and tolerate some configurations that Raft refuses. In practice, almost no production system needs that flexibility, and the simplicity of Raft has dominated.

A Raft leader election is the canonical example of how clean the protocol is. Followers run an election timeout (randomized between 150 and 300 milliseconds, typically). If a follower times out without hearing from the leader, it increments its term, votes for itself, and asks every other node for their vote. A node votes for at most one candidate per term and only if the candidate’s log is at least as up-to-date as its own. If a candidate receives a majority of votes, it becomes leader and starts sending heartbeats. If two candidates split the vote, both time out and try again with a new term.

sequenceDiagram
    participant F1 as Follower 1
    participant F2 as Follower 2
    participant C as Candidate
    Note over C: election timeout
    Note over C: term = term + 1, vote for self
    C->>F1: RequestVote(term, log info)
    C->>F2: RequestVote(term, log info)
    F1-->>C: VoteGranted
    F2-->>C: VoteGranted
    Note over C: majority received
    Note over C: become Leader
    C->>F1: AppendEntries (heartbeat)
    C->>F2: AppendEntries (heartbeat)

The diagram captures the entire election in five message exchanges. The Paxos equivalent, drawn honestly, requires a footnote about leader leases, ballot numbers, and what happens when two proposers race in Phase 1.

Why Raft mostly replaced Paxos

The shift in the industry between roughly 2014 and 2020 is almost entirely about onboarding cost. New distributed systems pick Raft because:

An engineer can read the Raft paper in an afternoon and implement a working version in a week. Paxos takes much longer, and the first implementation is usually subtly wrong.
Raft has reference implementations in every major language (etcd’s Go implementation, hashicorp/raft, tikv’s Rust raft-rs). Picking one off the shelf is a real option. Multi-Paxos has fewer such libraries, and the ones that exist are tightly coupled to their parent systems.
Debugging Raft in production is easier because the state machine is small and the invariants are explicit. Paxos debugging often comes down to “what proposal numbers were in flight when the leader failed.”

The trade-off is small enough in practice that the industry has voted with its feet. Multi-Paxos remains in older systems because rewriting a working consensus implementation is risky and rarely worth it. New systems pick Raft.

Real systems

A short tour.

Built on Raft. etcd (the Kubernetes brain), Consul (HashiCorp’s service catalog and config store), CockroachDB (each range is its own Raft group), TiDB (same model), RethinkDB, Vault, Nomad, MongoDB’s replica set election protocol since version 3.2 (a Raft-inspired variant). Apache Kafka since the KRaft transition.

Built on Paxos or a Paxos variant. Google Spanner (Multi-Paxos per tablet). Google Chubby (the original lock service that inspired ZooKeeper). Google Megastore. Apache ZooKeeper (which uses Zab, a Paxos variant tuned for ordered broadcasts). Microsoft’s Azure Cosmos DB (Multi-Paxos under several of its consistency levels). Several internal Amazon systems.

The list tells a generational story. Google built on Paxos in the 2000s because that was the best-understood algorithm at the time. The next generation of open-source systems, mostly written after 2014, picked Raft for the reasons above.

The performance cost

Consensus is not free. Every committed operation requires a round-trip to a majority of nodes. In a five-node cluster, that means waiting for at least three acknowledgments. The latency floor is set by the slowest node in the quorum, plus the network round-trip time.

In a single-datacenter cluster, this is around 1 to 2 milliseconds per commit. Across regions, it can be 50 to 200 milliseconds. This is why high-throughput consensus systems batch aggressively: a leader that ships 1000 log entries in a single AppendEntries message pays one round-trip cost for 1000 decisions, not 1000 round-trips. Batching is what makes Raft viable at scale, and it is the first place to look when consensus performance is bad.

The other lever is geography. A Raft group with members on three continents will pay continent-scale latency on every write. Architects who care about both availability and latency usually run multiple Raft groups, each pinned to a single region for its writes, with cross-region replication handled outside the consensus protocol. CockroachDB’s “geo-partitioning” and Spanner’s directory-level placement are both versions of this idea.

Closing the loop

Consensus is the foundation of every reliable distributed system. Without it, you cannot offer linearizable consistency, cannot reliably elect leaders, cannot evolve cluster configuration safely. With it, you can build systems that survive node failures, partitions, and operator mistakes, at the cost of a round-trip per decision and the operational complexity of running a quorum.

The next two lessons cover what happens when teams try to skip consensus and build distributed systems on weaker foundations: the failure modes that result, and the tools (gossip protocols, CRDTs, leaderless replication) that exist for the cases where consensus would be too expensive. Knowing when not to use consensus is as important as knowing how to use it.