Time in distributed systems: clocks, ordering, vector clocks

If consistency is about what reads can see, time is about what order things happened in. The two are tangled, because most consistency definitions are written in terms of order, and order is, on a single machine, easy. The clock ticks. Things happen. The first thing happens before the second.

On a distributed system, the clock does not tick. There are many clocks, on many machines, and they disagree. Two events at different machines have no inherent order, and “when did this happen” turns into one of the hardest questions in computing. The previous lesson assumed we had a way to order operations. This lesson explains why we usually do not, and what tricks the field has invented to recover ordering when it matters.

We will work through three layers: why physical clocks are unreliable, what logical clocks give us instead, and how Google paid serious money to make physical clocks reliable enough to power Spanner.

Why physical clocks are unreliable

The naive plan: every machine has a clock, the clock reports a number, we use the number to order events. The number is the time. Done.

The plan does not survive contact with a real datacenter. Five problems compound.

Clock skew. No two clocks agree exactly. Two machines from the same vendor, racked next to each other, fed the same NTP server, will report times that disagree by a few milliseconds at any given instant. The disagreement varies as the crystals warm up, as the load changes, as the machine ages. Skew of 1 to 10 milliseconds is normal. Skew of seconds is common in misconfigured environments. Skew of minutes happens, and is one of the most fun bugs to debug at 3 a.m.

NTP corrections. Network Time Protocol nudges the clock toward the truth, but the nudges are not always forward. If your clock has run too fast, NTP will slow it down or step it backward. A monotonically increasing clock is not what NTP gives you. Code that assumes “the next reading will be at least as large as the last” can break when NTP corrects.

Leap seconds. Roughly once every few years, the international timekeepers add a second to UTC because the Earth’s rotation does not exactly match atomic time. Some systems repeat 23:59:59. Some systems freeze. Some systems decline to acknowledge leap seconds and slowly diverge from UTC. The 2012 leap second took down Reddit, LinkedIn, Yelp, and a fair amount of the internet, because the Linux kernel of the day did not handle it gracefully. Leap seconds are a famous source of “we did not realise our timestamps were a lie” bugs.

Virtual machine drift. A VM’s clock is a guest of the host. When the host suspends the VM (live migration, oversubscription, anything that takes the CPU away from the guest for a while), the guest’s clock falls behind real time. When the VM resumes, the clock either jumps forward to catch up (breaking monotonicity) or slowly catches up while reading wrong values. Cloud workloads are full of this.

Cross-region uncertainty. Even if every clock in your datacenter is within a few milliseconds of truth, a clock in Singapore and a clock in Virginia, both perfectly synced to their respective NTP sources, can still disagree. The error budget grows with distance, with network conditions, with how often the sync runs.

The pragmatic conclusion: do not trust wall-clock time for anything that requires correctness. It is fine for human-readable log timestamps. It is not fine for deciding which of two writes happened first.

The “happened before” relation

Leslie Lamport, in 1978, wrote a paper called “Time, Clocks, and the Ordering of Events in a Distributed System” that solved this problem so cleanly that the next forty years of distributed systems theory built on it. The core idea is to abandon physical time entirely and define an ordering based on causality.

Two events are ordered, in Lamport’s sense, if one could have caused the other. Specifically, A “happened before” B (written A then B) when one of three things is true:

A and B happened on the same node, and A came first in that node’s local order.
A is the sending of a message and B is the receipt of that same message.
There is some chain of “happened before” relationships connecting A to B (it is transitive).

Anything else is concurrent. Concurrent events have no causal relationship: neither could have caused the other. The “happened before” relation gives a partial order, not a total order, and that is the honest answer. In a distributed system, you cannot in general say which of two events came first. You can sometimes say one caused the other.

The rest of the lesson is the engineering of “sometimes.”

Lamport timestamps

The simplest scheme that respects “happened before.”

Every node holds a single integer counter, starting at zero. Every event on the node increments the counter. When a node sends a message, it includes its current counter. When a node receives a message, it sets its counter to max(local, received) + 1.

The result: if A happened before B, then A’s timestamp is less than B’s. The reverse is not true: a smaller timestamp does not mean an earlier event, because two unrelated nodes can independently reach the same counter value or one can race ahead.

Lamport timestamps give a total order over events (break ties by node id) that is consistent with causality. Two events with timestamps 5 and 7 either have a real causal relationship or are concurrent. You cannot tell which from the timestamps alone, which is the catch: Lamport timestamps detect ordering, not concurrency.

sequenceDiagram
    participant A as Node A
    participant B as Node B
    Note over A: counter = 0
    Note over B: counter = 0
    A->>A: local event (counter = 1)
    A->>B: send msg, ts = 1
    Note over B: receive, counter = max(0,1)+1 = 2
    B->>B: local event (counter = 3)
    B->>A: send msg, ts = 3
    Note over A: receive, counter = max(1,3)+1 = 4
    A->>A: local event (counter = 5)

The diagram shows the counter advancing on each event and on each message receipt. Note that B’s timestamp 2 is greater than A’s timestamp 1, correctly reflecting that A’s send happened before B’s receive. But if some unrelated node C had also stamped an event as 2, we could not tell whether C’s event came before, after, or concurrent to B’s.

Vector clocks

To distinguish ordered from concurrent events, we need more information than a single counter. Vector clocks give us exactly that.

Every node holds a vector of counters, one entry per node in the system. Every local event on node i increments entry i. When node i sends a message, it includes the whole vector. When node j receives a message, it sets its own vector to the element-wise maximum of its current vector and the received one, then increments entry j.

Comparing two vector timestamps:

A is before B if every entry in A is less than or equal to the corresponding entry in B, and at least one is strictly less.
A is after B if the reverse holds.
A and B are concurrent if neither is before the other (some entries say A is older, others say B is older).

Vector clocks give us the full causality information. They tell you exactly which events are ordered and which are concurrent. The price is storage: every event carries a vector with one entry per node, and every message ships that vector. In a system with thousands of nodes, the vectors get expensive. Tricks like pruning entries for nodes that have not been heard from in a while help, but introduce their own subtle bugs.

Real systems that use vector clocks: Riak, the early versions of Dynamo (the Amazon paper, 2007). DynamoDB the product later moved to a slightly different scheme.

Hybrid logical clocks

A compromise between physical and logical time. Hybrid Logical Clocks (HLC) combine wall-clock time with a logical counter. Each timestamp is a pair: the wall clock at the time of the event, plus a small integer that breaks ties when the wall clock has not advanced.

The invariant: if A happened before B, then A’s HLC is less than B’s. Within a single node, the wall-clock part advances normally; the logical counter resets when the wall clock moves forward. Across nodes, the receive-message rule mirrors Lamport’s: take the max, then bump the counter.

HLCs are the modern compromise because they give you something that looks like a real timestamp (you can read it as a date for debugging) while providing the partial-ordering guarantees that matter. They are used in CockroachDB, MongoDB’s causal sessions, YugabyteDB, and elsewhere.

The catch: HLCs assume that wall-clock skew is bounded. If two nodes’ clocks drift by more than the protocol expects, the ordering can break. Most implementations cap how far ahead an HLC can run from the local wall clock and refuse to accept messages from nodes that are too far in the future.

Google TrueTime

The Spanner database, internally at Google, takes a different approach. Instead of working around unreliable clocks, Google built reliable clocks. Every datacenter that runs Spanner has GPS receivers and atomic clocks, redundantly, in multiple racks. The TrueTime API does not return a timestamp. It returns an interval: “the current time is somewhere between now - epsilon and now + epsilon,” where epsilon is the bound on uncertainty.

In a healthy Google datacenter, epsilon is around 1 to 7 milliseconds. The Spanner protocol uses these intervals to provide external consistency, which is a fancy name for linearizability with a real-time ordering guarantee. The trick: when committing a transaction, Spanner waits out the uncertainty. It picks a commit timestamp at the upper end of the current TrueTime interval, then sleeps until the lower bound of the new interval has passed that timestamp. By the time the wait is over, every TrueTime call anywhere in the world will return an interval whose lower bound is greater than the commit timestamp. The transaction is, by definition, in the past.

The cost: every Spanner commit pays a few milliseconds of “commit wait.” The benefit: linearizability across continents, without a global lock, without a global clock, without a single point of coordination. It is the only production system that pulls this off, and it does so by being willing to spend serious money on hardware. TrueTime is not an algorithm; it is an algorithm plus a fleet of GPS antennas plus a building full of atomic clocks. If you do not have Google’s budget, you do not have TrueTime.

When time matters

Three classes of problem where the clock question becomes urgent.

Audit logs. A regulator asks: in what order did these transactions happen? If your system stores wall-clock timestamps from each node, you cannot answer honestly: the timestamps are off by milliseconds at best. If your system stores Lamport or HLC timestamps, you can answer for events that have a causal relationship, and admit “concurrent” for the rest. The honest answer is more useful than the wrong one.

Conflict resolution. Two writes hit different replicas at almost the same instant. Which one wins? “Last writer wins” sounds simple but requires a defined notion of “last,” and physical clocks do not provide one. Better systems use vector clocks to detect the conflict, then resolve it with application logic (CRDTs, for example, are explicitly designed to merge concurrent writes safely).

Distributed transactions. A transaction spans multiple shards and must commit atomically. The commit protocol needs to assign the transaction a timestamp that is greater than every timestamp it depends on, but less than every timestamp it conflicts with. Without TrueTime or its equivalent, this is the hard part of two-phase commit.

Scheduling. “Run this job no earlier than 9:00 AM” needs a notion of 9:00 AM that every node agrees on. Wall-clock timestamps work here, because the consequences of running a job a few seconds early or late are usually bounded. But for tighter scheduling (say, “rotate the encryption key at exactly midnight, atomically”), you need consensus, not just a clock.

Pragmatic advice

Three rules of thumb.

First, do not trust the wall clock for anything that requires correctness. Read it for log timestamps, for human displays, for cron jobs. Do not use it to decide which write wins.

Second, use the right clock for the job. Monotonic clocks (the kind that only ever go forward, like CLOCK_MONOTONIC on Linux) are the right tool for measuring durations: how long did this request take, has the lease expired. Logical clocks (Lamport, vector, HLC) are the right tool for ordering events across nodes. Wall clocks are the right tool for “what time is it” and very little else.

Third, when the requirement is “linearizable across continents,” accept the cost. The cost is either coordination latency (one round-trip per write) or a TrueTime-style hardware investment. There is no third option. Anyone selling you “linearizable, low-latency, multi-region, no coordination” is selling you a bug.

The next lesson goes one layer deeper, into the consensus protocols that turn a group of nodes with disagreeing clocks and unreliable network into a system that can agree on a single value. Paxos, Raft, and the systems that run on top of them.