Data & System Architecture, from the ground up Lesson 7 / 80

When the first architecture stops being enough

The symptoms that say the single-server app has reached its ceiling. Database contention, deploy-causes-outage, the daily backup that takes longer than a day.

The single-server architecture works. It works well. It works for a long time, often longer than the team building on top of it expects, and that is the central thesis of the first six lessons of this course. One box, one Postgres, one app process, one nginx in front, and you can serve a real business with real users for years.

And then one day it doesn’t work. The interesting question is not “what does the failure look like.” Outages are loud and obvious. The interesting question is: what are the early signs, the quiet symptoms that show up months before the page goes down at three in the morning, and what should you do about each one?

That is what this lesson is about. Six symptoms. For each one, what to look for, what is actually happening underneath, and a one-line preview of the fix that the rest of the course will spend chapters explaining.

Symptom 1: database contention

You start seeing slow queries that, looked at in isolation, should not be slow. The query plan is fine. The index is being used. On a quiet test database the query returns in eight milliseconds. In production, the same query sometimes takes four seconds, sometimes thirty. Connection pool errors start showing up in the logs. Deadlocks appear under normal Tuesday afternoon load.

The first instinct is “the database is slow, we need a bigger box.” Sometimes that is right. More often, the database itself is fine; you have one machine, one database, and a growing number of concurrent requests fighting over the same resources. Two requests want to update the same row. Twenty requests want to read a table that a long-running transaction is holding a lock on. Your max_connections is 200 and your app server pool size, multiplied by replicas, exceeds it.

Things to look for:

  • The 95th percentile latency on simple queries climbing while the median stays flat. That is the shape of contention, not slowness.
  • Lock wait time as a percentage of total query time. If it is above ten percent, you have a contention problem, not a query problem.
  • Connection pool errors that say “timed out waiting for a connection” rather than “could not connect.”
  • Deadlocks that did not exist a quarter ago, on tables that did not change schema.

The next move is to introduce a connection pooler in front of the database (PgBouncer, RDS Proxy, ProxySQL), to start separating read traffic from write traffic with a read replica, and to push long-running transactions out of the request path. Connection pooling and read replicas get a full chapter later in the course; the punch line is that you have probably been treating “the database” as one resource when it is actually three: the connection slots, the read capacity, and the write capacity. Each one can fail independently.

Symptom 2: a single-server outage stops being a small problem

When you had ten users, the hour-a-month maintenance window was fine. When you have ten thousand users, the same hour costs you real money, real refunds, and real conversations with executives.

The shift here is not technical. The architecture is the same. What changed is the cost of the outage. Two numbers start to matter that did not before:

  • Recovery time objective (RTO): how long you can be down before the business is in real trouble.
  • Recovery point objective (RPO): how much data you can afford to lose if the worst happens.

When the application was small, the RTO was “we get to it when we get to it” and the RPO was “the last nightly backup.” When the application is mid-sized, the RTO is measured in minutes and the RPO in seconds. The single-server architecture cannot meet those numbers, because there is exactly one machine and if it is down the system is down.

Things to look for:

  • Your incident postmortems start mentioning revenue impact in dollars, not in apologies.
  • Customer support starts having a “deploy day” playbook because they get tickets every time you ship.
  • Someone, probably in finance, asks “what is our uptime SLA” and you do not have a clean answer.

The next move is to stop having one machine. That can mean a hot standby database, an active-active app tier behind a load balancer, or, eventually, a multi-region setup. The trade-off is that the moment you have more than one machine, you have a distributed system, and distributed systems have their own failure modes. That is what Module 2 is about.

Symptom 3: deploys cause outages

The thirty-second restart that nobody noticed when the app was small now shows up in the support inbox. “We tried to check out at 3:14 PM and it failed.” You check the logs. 3:14 PM was when you rolled out the new build. The restart took 28 seconds, during which the server returned 502s and a user with cash in hand could not give it to you.

This is the moment a lot of teams decide they need zero-downtime deploys, and they are right. The fix is well understood: run more than one instance behind a load balancer, take instances out of rotation one at a time, drain in-flight requests, start the new version, health-check it, put it back in rotation. The orchestration layer, whatever it is, handles the choreography.

Things to look for:

  • Deploy-related error spikes in your dashboards, even small ones.
  • Engineers preferring to ship at 2 AM “to be safe.”
  • A deploy frequency that has dropped because shipping is scary.

That last one is the real signal. If the team is shipping less often because deploys are risky, you are paying a tax on every feature you ship from now until you fix it. The fix is, again, to stop having one machine, but specifically to stop having one app process. Behind the same database, run two app instances. Then deploy them one at a time. The blast radius of a deploy goes from “everyone” to “the few requests in flight on the instance currently restarting.”

Symptom 4: backups grow past the window

You set up a nightly logical dump. pg_dump at 2 AM, ship to S3, done. For two years it has taken about thirty minutes and used a fraction of the disk and network. One day someone notices the backup did not finish before the morning batch job started. You look at the logs. The dump is now taking fourteen hours. It overlaps with itself the next night. Soon, you will not be able to back up daily without two backups running simultaneously, fighting for IO.

This is the data-volume equivalent of the deploy problem. The technique that worked at small scale (one giant logical dump) does not work at large scale, and the fix is structural, not tactical. You move to physical backups (pg_basebackup, WAL archiving, or whatever your engine’s equivalent is). You take a full backup once a week, incremental backups daily, and continuous WAL streaming for point-in-time recovery. The full backup might still take fourteen hours, but it runs on a replica and does not block production.

Things to look for:

  • Backup wall-clock time as a percentage of 24 hours. Anything over 30 percent is a warning. Anything over 50 percent is an emergency.
  • Backup-related IO spikes that affect production query latency.
  • “We have not actually tested a restore in six months” said by anyone, ever. (This is its own emergency, separate from the size problem.)

A backup you cannot restore in your RTO is not a backup. A backup that takes longer than the window between backups is not a backup either. Both problems show up at scale and force the same architectural shift: backups need to run somewhere other than the primary, and they need to be incremental.

Symptom 5: one slow endpoint blocks the rest

Sales operations runs a heavy report on /admin/sales once a day. It scans a year of order history, joins against products and customers, and returns a CSV. It used to take twelve seconds, which was annoying but acceptable. Now it takes three minutes, and during those three minutes, your /api/orders endpoint, which a hundred customers per minute hit to check out, gets slow. Sometimes it times out.

What is happening here is straightforward. The single-server architecture has one process, one connection pool, one set of database connections, and one CPU budget. The reporting query is using all of them. Customer-facing requests are queueing behind it. The fix is to separate the workloads: a different process pool for background and reporting work, a different database replica for read-heavy analytical queries, a different deploy lifecycle for the reporting code.

Things to look for:

  • Latency on customer-facing endpoints that correlates with the time of day a known heavy job runs.
  • A connection pool sized for normal load that gets exhausted whenever the report runs.
  • Database CPU charts that show a sustained one-process-pegged pattern instead of a noisy multi-process pattern.

The next move is to introduce a worker pool: a separate set of processes, ideally a separate set of machines, that handle background work, scheduled jobs, and heavy analytics. Reads can go to a replica. Writes still go to the primary. The customer-facing tier stays small and predictable.

Symptom 6: the schema migration does not fit in the window

You have a 200 million row table. You need to add a column. You need to backfill it. The naive approach (a single ALTER TABLE followed by an UPDATE) will lock the table or rewrite it for hours. You have a Saturday-night maintenance window of two hours. The migration is going to take six. You cannot run it.

This is the schema equivalent of symptoms 4 and 5. The technique that worked when the largest table was a million rows does not work at two hundred million. The fix is to do migrations online, in pieces, with the running application still serving traffic. Add the column nullable. Backfill in batches of ten thousand rows with delays between batches. Add the NOT NULL constraint at the end as a separate, fast operation. Use feature flags to swap reads from the old shape to the new shape.

Things to look for:

  • A migration plan that has the words “downtime” and “weekend” in it.
  • Migrations being delayed quarter after quarter because they cannot fit.
  • A team that is afraid to evolve the schema, which over years becomes a team that is afraid to ship product.

The next move is to adopt online migration patterns and tooling. pg_repack and pt-online-schema-change exist for a reason. Stripe’s “Online Migrations” essay, which we will reference later in this course, lays out the playbook. The architectural shift is from “the schema is a thing we change in maintenance windows” to “the schema evolves continuously and the application has to tolerate both shapes for a while.”

The stages of growth

Each of the six symptoms above is a step on the same staircase. The staircase looks roughly like this:

flowchart TB
    A[Single server<br/>app + DB on one box<br/>up to ~50 req/s, 50 GB] --> B[Read replicas + pooler<br/>splits read from write<br/>up to ~500 req/s, 500 GB]
    B --> C[Separate worker pool<br/>background work off the request path<br/>up to ~2k req/s, 2 TB]
    C --> D[Service split<br/>independently deployable services<br/>up to ~10k req/s, 10 TB]
    D --> E[Distributed data layer<br/>sharding, queues, caches<br/>up to ~50k req/s, 100 TB]
    E --> F[Multi-region<br/>geographic redundancy<br/>global users, multi-PB]

The numbers on each step are deliberately vague. They depend on the workload (read-heavy versus write-heavy, OLTP versus OLAP, average request cost, peak-to-average ratio). What is not vague is the order. Almost every team that grows past the single server goes through these steps roughly in this sequence. Skipping a step is possible but expensive: a team that jumps from a single server to a microservices-on-Kubernetes architecture is paying for stages 2, 3, and 4 of operational complexity without having the load to justify it. We will see in the next lesson, with three real case studies, what happens to teams that do that.

Diagram to create: an alternative “symptom-to-stage” map that puts each of the six symptoms next to the stage on the staircase that addresses it. Symptom 1 maps to stage B, symptom 5 to stage C, symptom 3 to stage B (and partially D), and so on. Useful as a one-page reference.

What the rest of the course is about

If you read the six symptoms and recognized at least one as something you are dealing with right now, the rest of the course is for you. Module 2 covers the fundamentals of distributed systems. Module 3 covers data layer scaling: replicas, pooling, sharding. Module 4 covers worker tiers and queues. Module 5 covers service decomposition. Module 6 covers operations: observability, deploys, schema evolution at scale.

The structure is deliberate. We do not start with microservices because most teams do not need them. We start with the symptoms, because the symptoms tell you what move to make next. An architecture is not a set of trendy components; it is a set of decisions you made, in order, in response to specific problems. If you do not have the problem yet, you do not need the decision yet.

The next lesson, a case-study lesson, is about three companies that resisted the temptation to over-engineer longer than anyone expected, and what they got out of it.

Search