Data & System Architecture, from the ground up Lesson 28 / 80

Hot keys and the rebalancing problem

The celebrity user with one million followers. How to detect a hot key, three strategies to handle it, and why rebalancing a live cluster is harder than it sounds.

The previous lesson set out the three ways to spread data across machines: hash partitioning, range partitioning, and the directory-based variant. All three assume something that is rarely quite true in production: that the work, once split, is split evenly. In real systems the load distribution is lumpier than the data distribution. One user has a million followers. One product is on the front page of every news site for a day. One hashtag is trending. The partition that holds that key gets disproportionate traffic, and the careful balance you arranged on day one quietly stops working.

This lesson is about that lump. What it looks like in monitoring, why it happens, and the three honest answers to it. Then we turn to the related problem of rebalancing: when a node joins or leaves the cluster, data has to move, and moving data on a live system is a lot harder than the diagrams suggest.

What a hot key looks like

The textbook symptom is one node working and the others coasting. You open the per-node CPU graph and see one line at 90 percent and the rest at 10. You open the per-node request rate and one node is taking five times the queries of its peers. You open the per-task latency in monitoring and the histogram is bimodal: most operations finish quickly, a stubborn tail finishes slowly, and the slow tail comes from one shard.

If your monitoring goes deeper, you can usually pin the lump on a specific key. Most modern data stores expose per-key or per-partition request counters, sampled or aggregated. The shape you are looking for is a very long tail: a top-K listing where the first key is taking ten or a hundred times the traffic of the median key. That is the hot key.

Three things make this worth detecting fast. First, the hot node is a latency outlier for every other key on that shard, because they are all queueing behind the celebrity. Second, the hot node is the one that will run out of CPU, memory, or disk first, and a single-node failure under that load is the start of an incident. Third, your nominal capacity calculation is wrong: if one node is at 90 percent and the cluster average is 30, you do not have 70 percent headroom, you have 10 percent.

Why hot keys happen

The common causes are worth naming, because the right fix depends on which one you have.

Skewed access to a popular entity. A celebrity user, a viral product, a trending topic. The data distribution is fine; the access pattern is not. Most users have a thousand followers, a few have a million. Most products sell ten copies a day, one sells a hundred thousand on launch day. The partition that holds that entity is doing real work that the others are not.

Bad partition key choice. A key that looked balanced in the abstract turns out to concentrate traffic in practice. Partitioning by country_code and discovering that 60 percent of users live in one country. Partitioning a multi-tenant database by tenant_id and discovering that one tenant is bigger than the other thousand combined. The partition function did exactly what you asked; the data is the problem.

Time-based hot spots. Range-partitioned by date: today’s partition gets all the writes and most of the reads, last week’s partition gets occasional reads, last year’s partition is cold. The cluster has the data evenly distributed by volume but the load lives entirely on the most recent shard.

Thundering herd. A cache miss on a popular key causes every web server to query the same backend partition simultaneously. The data store sees a sudden synchronised spike on one key from a hundred clients. The partition that holds that key is fine on average and on fire for five seconds at a time.

Each pattern has the same surface symptom (one hot partition) and a different root cause. Diagnosing which one you have is the work that justifies the fix.

Three honest fixes

There is no clever way to make a hot key not be a hot key. The honest fixes are: spread the key across multiple partitions, front it with something faster, or give it its own infrastructure. Each has costs.

Salting. Append a small random suffix to the key before hashing. Instead of writing follower counts under user:42, you write under user:42:0 through user:42:15, picking a suffix at write time. Reads have to fan out across all sixteen sub-keys and aggregate. The work is now spread across sixteen partitions instead of one, the hot node is no longer hot, and the cost is the read-side fan-out and the application complexity of treating the logical key as a set of physical keys. This pattern recurs in every distributed processing course, including the PySpark course on this site, where it is the standard cure for skewed joins.

Caching the hot key. Front the partitioned store with a separate cache (Redis, in most stacks) that absorbs the read traffic for the popular keys. The partitioned store still owns writes and the long tail of cold keys; the cache owns the small set of keys that are hot. This shifts the hot-key problem from the database to the cache, and Redis is built for exactly that load. The cost is the operational surface of a second store and the cache-invalidation question (lesson 24 covered the polyglot variant of the same conversation).

Dedicated shards for the giants. If one user is enormous, give them their own infrastructure. Twitter is the canonical example: celebrity accounts are not stored or fanned-out the same way as ordinary accounts, because doing so would melt the cluster every time one of them tweets. The cost is operational: a separate code path, a separate set of shards, a separate failure mode. The benefit is that the giant stops contaminating the load profile of everyone else.

The decision tree is roughly: salting if the access pattern is read-heavy and the key set is small enough to fan-out; caching if the bulk of the traffic is reads of a small hot set; dedicated shards if the imbalance is structural and the hot entities are few and named.

Detecting hot keys in monitoring

A few practical notes on detection, because the fix only helps if you spot the problem first.

Most distributed databases expose per-partition metrics: request count, request latency, bytes in, bytes out, sometimes per-key sampling. Cassandra has nodetool tablestats and per-table coordinator latency. MongoDB exposes per-shard operation counters and a shard balancer log. Redis Cluster exposes per-node and per-slot metrics. The pattern to watch is variance across partitions, not the absolute numbers. If the hottest partition is doing five times the work of the median, you have a hot key whether or not you have a problem yet.

Top-K sampling at the application layer is the other half. Log the keys of slow queries, count them, alert when one key crosses a threshold. The crude version is a bounded-size hash map of recent slow keys. The fancy version is a count-min sketch or a heavy-hitters algorithm. Either one tells you which key to investigate when a partition gets hot.

Rebalancing: when nodes join or leave

The other end of the same problem is rebalancing. Hot keys are about uneven load on a static topology. Rebalancing is about moving data when the topology changes. You add a node because traffic grew, or you remove a node because it failed, or the autoscaler decided to grow the cluster overnight. Whichever direction, data has to move, and moving data on a live cluster while it is serving traffic is one of the harder operational problems in distributed systems.

There are two main approaches.

Static rebalancing with many small partitions. Pre-allocate many more partitions than nodes (Cassandra’s vnode model defaults to 256 virtual nodes per physical node) and assign whole partitions to nodes. When a node joins, it claims its share of partitions from the existing nodes; when one leaves, its partitions are redistributed. The number of partitions never changes, only the assignment. This is operationally simple and gives a smooth load profile. The cost is that you have to choose the partition count at creation time and live with the consequences: too few partitions and you cannot scale beyond a small number of nodes, too many and the metadata overhead becomes its own problem.

Dynamic rebalancing with partition splits. Start with a small number of partitions and split them when they grow too large or too hot. HBase splits regions when they cross a size threshold. MongoDB sharded clusters split chunks above a target size and move chunks between shards when the load is uneven. The cluster grows and rebalances itself based on observed conditions. The cost is the complexity of the splitting machinery and the variability in operational behaviour: a partition can split at a moment you would rather it did not, and the rebalancing background traffic competes with user traffic for the same disks and networks.

Range-partitioned systems have an extra wrinkle: a popular range gets hot, a sparse range gets unused, and the rebalancer has to know about both load and data volume. A pure size-based splitter is wrong for range partitioning, because a small but hot range needs to be split too.

Why live rebalancing is harder than it looks

The diagrams of rebalancing are tidy: a partition gets too big, a line splits it in two, the routing layer learns the new layout, traffic continues. The reality is messier.

Clients have to know where data lives. If the topology changes while a client is mid-request, the client has to retry against the new owner, and that retry has to be safe (lesson 16 was about idempotency for exactly this reason). In-flight queries can fail. Cross-partition transactions can be split mid-flight by the rebalancer if you let them, so the rebalancer has to coordinate with the transaction manager. The data being moved still has to serve reads and accept writes, which usually means the source partition stays authoritative until the destination has caught up, then a brief synchronised cutover hands ownership over.

Most real systems delegate the topology question to a small dedicated coordinator: ZooKeeper, etcd, or a built-in consensus group running Raft. Clients ask the coordinator (directly or via cached lookups) where a given key lives, and the coordinator manages the membership and partition-assignment changes through a consensus protocol. This is exactly the role lesson 14 described: the configuration store is the load-bearing wall of the entire cluster, and it is consensus-backed for the same reasons every other shared decision is.

flowchart LR
    Client[Client] --> Router[Routing layer]
    Router --> P1[Partition 1 - node A]
    Router --> P2a[Partition 2a - node B]
    Router --> P2b[Partition 2b - node C]
    Coord[(Coordinator - Raft)] -->|topology| Router
    P2a -.->|was Partition 2| P2b

The diagram shows what changed. Partition 2 grew too hot or too large, the coordinator decided to split it into 2a and 2b, and the routing layer was updated to send traffic for the lower half to node B and the upper half to node C. The coordinator is the only place the truth lives. The router caches it; the partitions implement it; the clients depend on whoever the router currently believes.

What this sets up

Hot keys and rebalancing are the operational tax of partitioning. You can do the partition design correctly on day one and still find, six months in, that one celebrity account is melting one shard, or that the nightly batch job is moving partitions for two hours and competing with morning peak. The fixes exist, but each one has a cost, and the right one for your workload is the one whose cost your team can pay.

The next lesson moves up a level: how do you build a sharded SQL system on top of these primitives? The choices are application-level routing, a Postgres extension like Citus, a MySQL cousin like Vitess, or the built-in sharding of MongoDB and Cassandra. Each makes different trade-offs around control, complexity, and SQL compatibility, and the migration from unsharded to sharded is one of the hardest engineering projects most teams ever face. That is lesson 29.

Citations and further reading

  • Martin Kleppmann, “Designing Data-Intensive Applications” (O’Reilly, 2017), Chapter 6 (partitioning) and Chapter 9 (consistency and consensus). The book-length treatment of the hot-key and rebalancing problems.
  • Cassandra documentation, “Adding, replacing, moving and removing nodes”, https://cassandra.apache.org/doc/latest/cassandra/operating/topo_changes.html (retrieved 2026-05-01). The vnode model in practice.
  • MongoDB documentation, “Sharded Cluster Balancer”, https://www.mongodb.com/docs/manual/core/sharding-balancer-administration/ (retrieved 2026-05-01). Chunk splitting and the balancer.
  • HBase Reference Guide, “Region Splitting”, https://hbase.apache.org/book.html#regions.arch (retrieved 2026-05-01). The dynamic-rebalancing model.
  • etcd documentation, https://etcd.io/docs/ (retrieved 2026-05-01). The reference coordinator for cluster topology.
Search