Data & System Architecture, from the ground up Lesson 29 / 80

Sharding strategies and their gotchas

Application-level sharding, database-native sharding, Citus and Vitess. The practical realities of running a sharded SQL database.

The previous lesson left us with the operational problems of partitioning: hot keys, rebalancing, the coordinator that has to track who owns what. All of that was abstract enough to apply to any partitioned data store. This lesson lands in a much more specific question that most teams eventually have to answer: how do you actually shard a SQL database in production, when the single-server Postgres or MySQL you started with is no longer enough?

The vocabulary first. “Sharding” is horizontal partitioning across multiple physical machines. The concept is the same as partitioning, but the connotation is different: when somebody says partitioning they often mean a single distributed system that has multiple internal partitions (Cassandra, Mongo, ScyllaDB), and when they say sharding they usually mean a fleet of independent database instances coordinated above the database layer. The line is fuzzy and the words are used interchangeably in practice. What matters is the concrete architecture, not the label.

This lesson walks through the four real options in 2026. Application-level sharding, where your code routes queries to the right database. Postgres-native sharding via the Citus extension. MySQL sharding via Vitess (and PlanetScale, the popular managed version). And the built-in sharding of database-native systems like MongoDB and Cassandra, which we have already touched on and which we revisit briefly here for completeness.

Application-level sharding

The simplest mental model. You have N independent Postgres or MySQL instances, each with the same schema, each holding a subset of your data, and your application code knows how to route a query to the correct one. The routing key is usually tenant_id or user_id, the routing function is usually a hash modulo the shard count or a range lookup in a configuration table, and every query the application issues includes enough information for the application to pick the right shard.

The pros are real. You have total control: every shard is a vanilla database that you understand, that has the operational tools you already use, that fails in ways you have already debugged. You do not need a fancy database product. Per-shard backup, monitoring, and upgrade are just per-database backup, monitoring, and upgrade, repeated N times. Each shard scales independently. You can run different versions on different shards if you have to. You can stop a shard for maintenance without involving the others.

The cons are also real. Cross-shard queries are your problem. A query like “find me all users created in the last week across the entire fleet” has to be issued to every shard, the application has to assemble the results, and the application has to deal with the case where one shard is slow or down. Schema changes have to be applied to every shard, and the migration tooling has to handle the fleet-wide rollout. Joins across shards are not joins, they are application-level merge code. Distributed transactions across shards are out of scope unless you build them yourself, which you should not.

The honest summary: application-level sharding is the right answer for very large products with dedicated platform teams who are willing to invest in operational tooling and do not need cross-shard transactions in the hot path. Many of the largest SQL deployments in the world (Stripe, Shopify, Notion, Figma) are application-sharded Postgres or MySQL fleets, with thousands of shards and a substantial team running them. The pattern works at extreme scale; it is also expensive.

Citus, Postgres-native sharding

Citus is a Postgres extension that adds distributed tables, distributed transactions, and a distributed query planner on top of standard Postgres. It started as an independent company, was acquired by Microsoft, and is now both an open-source extension and the engine behind Azure’s “Cosmos DB for PostgreSQL”. The architecture is a coordinator node and a fleet of worker nodes; the coordinator parses incoming SQL, plans which workers should run which fragments, and assembles the results.

In Citus, you declare which tables are distributed and on what column. A typical pattern is to distribute a multi-tenant application by tenant_id: every distributed table has tenant_id as part of its primary key, and the planner uses that column to route queries. A query that filters on tenant_id is pushed entirely to the worker that owns that tenant; a query that does not filter on the distribution column fans out to every worker and is reduced at the coordinator. There are also reference tables (small lookup tables replicated to every worker) and local tables (held only at the coordinator) for the data that does not fit the distributed pattern.

The query planner is the cleverest piece. It does cross-shard joins by either co-locating the joined tables on the same distribution key (so the join is local on each worker) or by broadcasting one side to all workers. It supports distributed aggregates by computing per-shard partial results and combining them at the coordinator. It supports inserts, updates, and deletes that touch a single shard with full transactional semantics, and multi-shard writes via two-phase commit (lesson 15) when you opt in.

The case for Citus: you want sharded SQL but you want to keep Postgres semantics. The schema is a Postgres schema, the queries are Postgres queries, the extensions you rely on still work. Your application does not need to know about shards; the coordinator does. The case against: you are now operating a more complex system than vanilla Postgres, the cross-shard query path has its own failure modes, and the coordinator is a hop on every query. For workloads that fit the multi-tenant pattern Citus is excellent. For workloads where most queries naturally cross the distribution key it is less of a fit.

Vitess, MySQL-native sharding

Vitess is the MySQL equivalent. It was built at YouTube to handle their MySQL fleet, was open-sourced, and is now a CNCF graduated project. The architecture is a fleet of MySQL primaries and replicas (the “tablets”), a routing layer (vtgate), and a topology service (etcd or ZooKeeper) holding the metadata about which keys live on which shard.

Vitess presents a single MySQL endpoint to the application: clients connect to vtgate, which parses the SQL and routes it to the appropriate tablets. Like Citus, it does cross-shard joins, distributed aggregates, and per-shard transactions, with the trade-offs you would expect (cross-shard queries are slower and have more failure modes than single-shard queries). It also handles online resharding: split a shard into two, with the rebalancer copying the data and the routing layer cutting over once the destination has caught up.

PlanetScale is the popular managed version. It runs Vitess as a service, exposes a serverless-database experience, and adds a Git-style branching workflow for schema changes. The branching feature is genuinely interesting: schema changes happen on a branch, you merge them into the main branch, and the underlying Vitess machinery handles the safe rollout. For teams that want sharded MySQL without operating Vitess themselves, PlanetScale is the obvious starting point.

The case for Vitess: same as Citus, in MySQL-flavoured form. The case against: same as Citus, in MySQL-flavoured form. The two products solve the same problem in two different SQL ecosystems, and the choice usually comes down to which database your team already runs.

flowchart TB
    App[Application] --> Coord[Coordinator vtgate or Citus]
    Coord --> W1[(Worker 1 - shards A,B)]
    Coord --> W2[(Worker 2 - shards C,D)]
    Coord --> W3[(Worker 3 - shards E,F)]
    Topo[(Topology - etcd)] -->|metadata| Coord

The shape is identical for both products. Application talks to a coordinator, coordinator fans out to workers, topology service holds the truth about which shard lives where. The labels change, the architecture does not.

Database-native sharding

The fourth option is to use a database that was built sharded from the start. MongoDB sharded clusters, Cassandra, ScyllaDB, CockroachDB, YugabyteDB, TiDB. Each one of these systems exposes a single logical database that is internally a partitioned cluster, with the sharding logic built into the engine rather than bolted on as an extension or implemented in the application.

The pros: sharding is transparent to the application; rebalancing is automatic; the operational story is one product instead of two. The cons: you are using a different database than the SQL one your team already knows, and the SQL features you rely on may be partially supported or supported with caveats. CockroachDB and Yugabyte aim for full PostgreSQL wire compatibility and broad SQL feature support; TiDB targets MySQL compatibility; MongoDB and Cassandra do not offer SQL at all in the strict sense.

For teams choosing a database from scratch in 2026, the sharded-from-day-one products are a serious option, especially CockroachDB, Yugabyte, TiDB, and Spanner-style cloud offerings. For teams who already have a Postgres or MySQL fleet and need to grow it, the conversation is usually between application-level sharding and Citus or Vitess.

The migration nightmare

Most teams do not arrive at “should I shard?” until they have a several-year-old unsharded database that they have to convert to a sharded one without taking the application down. This is one of the hardest engineering problems most teams will face, and it deserves explicit treatment.

The dishonest version is “we will pre-shard from day one.” This is overkill for almost every team. You do not know your access patterns yet. You will pick the wrong distribution key and have to redo it. You will pay the operational and complexity cost of a sharded database before your traffic justifies it. Pre-sharding is the right call only for teams that already know, from prior experience, that they will be at the scale that requires it within a year or two. For everyone else it is premature.

The honest version is live migration. The pattern is well-documented and Stripe’s “online migrations” blog post is the canonical reference (https://stripe.com/blog/online-migrations, retrieved 2026-05-01). The shape of the migration is:

  1. Stand up the new sharded system alongside the old monolithic database.
  2. Begin dual-writing: every write that goes to the old database also goes to the new sharded system, idempotently.
  3. Backfill: copy the historical data from the old database to the new one in batches, with checks that the dual-writes have not been overtaken by the backfill.
  4. Validate: run reads against both systems for a period, compare the results, fix the inconsistencies until the systems agree on every key.
  5. Switch reads: route read traffic to the new system. The old system is still receiving writes via the dual-write path.
  6. Stop the dual-write to the old system: writes now go only to the new system.
  7. Decommission: archive and delete the old system.

This takes months, not weeks. There are sub-problems at every step (how do you make every write idempotent if the application was not designed for it; how do you handle in-flight reads during the cutover; how do you back-fill a billion rows without crushing the source database; how do you validate that two systems agree without taking a global lock). Every team that has done this has war stories.

The decision tree for “should I shard, and if so, how?” goes something like this. If your data is comfortably under a terabyte and your write rate is comfortably under a few thousand per second, do not shard. Postgres on a big single instance will handle it for years. If you are pushing those limits, ask whether you can scale vertically first: bigger instance, more memory, faster disks, read replicas for read-heavy workloads. If vertical scaling is not enough, consider whether your workload fits a multi-tenant pattern; if it does, Citus or Vitess is the lowest-friction sharded SQL option. If your workload genuinely needs distributed SQL with strong consistency across shards, look at CockroachDB, Yugabyte, or Spanner. If you need control above all and have the team for it, application-level sharding remains the ceiling.

What you should give up is the idea that sharding is free or that the migration will be quick. Both are unkind myths.

What this sets up

We have walked through the practical realities of partitioning and sharding: how to choose a distribution key, how to handle hot keys, how to rebalance live, how to actually shard a SQL database. The shape of the next question is one that has been hovering throughout: when the system splits, when the network fails, when two halves of a cluster cannot talk to each other, what happens to the leadership and consistency guarantees you spent Module 2 establishing?

The headline failure mode is split brain: a network partition that causes two halves of a cluster to both believe they are the leader, with both halves accepting conflicting writes. Lesson 30 is about that failure mode, why quorum is the only reliable defense, and the deployment recipes that keep it from happening to you.

Citations and further reading

  • Stripe Engineering, “Online migrations at scale”, https://stripe.com/blog/online-migrations (retrieved 2026-05-01). The canonical reference for live database migrations.
  • Citus documentation, https://docs.citusdata.com/ (retrieved 2026-05-01). The Postgres-extension sharding system.
  • Vitess documentation, https://vitess.io/docs/ (retrieved 2026-05-01). The MySQL sharding system.
  • PlanetScale documentation, https://planetscale.com/docs (retrieved 2026-05-01). The managed Vitess service.
  • CockroachDB documentation, https://www.cockroachlabs.com/docs/ (retrieved 2026-05-01). One of the sharded-from-day-one SQL options.
  • Martin Kleppmann, “Designing Data-Intensive Applications” (O’Reilly, 2017), Chapter 6. Background on sharding strategies.
Search