Disaster recovery: RTO, RPO, the drill

Most architecture diagrams assume the building is still standing. Disaster recovery is the discipline of asking what happens when it isn’t. The primary region goes dark for six hours. The production database is corrupted by a bad migration that ran cleanly through review. A ransomware payload encrypts every file the production service account can write. A vendor that you depend on for identity has a global outage and your control plane is unreachable.

These events are rare. They are not zero. A platform old enough to matter will see at least one of them, and the question on that day is whether the team has practiced the response or is improvising it. Improvising a recovery at 4 a.m. with the CEO on a bridge call is the worst possible time to discover that the backups have been failing silently for eleven months.

Disaster recovery is a separate concern from the day-to-day reliability work that fills Module 8. SLOs and on-call rotations handle the failures that happen weekly: a pod crash, a noisy neighbour, a deploy that needs rolling back. DR handles the failures that happen rarely and matter enormously. The shape of the problem is different, the tools are different, and the discipline is different.

Two dials: RTO and RPO

Every DR conversation starts with two numbers, and arguments that skip them produce plans that satisfy nobody.

Recovery Time Objective, RTO, is how long the business is willing to be without the service after a disaster. RTO of one hour means the service must be back within an hour of the disaster being declared. RTO of one day means a full day of downtime is tolerable, painful but survivable. RTO is a business decision dressed in engineering language: it asks the executives how much downtime costs them, and it produces a budget the architecture has to respect.

Recovery Point Objective, RPO, is how much data loss is acceptable. RPO of zero means no transaction can be lost. RPO of one hour means you can lose up to one hour of data and the business will still function. RPO of a day means you can lose a day, which sounds extreme until you remember that a corporate finance system probably can.

The two dials are independent. A service can have a fast RTO and a loose RPO: the website is back in five minutes but it lost the last hour of orders, which the team will reconstruct from logs. A service can have a slow RTO and a tight RPO: the warehouse can be down for half a day during recovery but no row is allowed to disappear. The mix depends on what the system does. Payments and ledgers have tight RPO; analytics platforms often have loose RPO and moderate RTO; internal tools usually have generous tolerances on both.

The honest version of an RTO and RPO conversation is uncomfortable, because the question “what’s the cost of downtime per hour?” rarely has a clean answer. The team picks a number that feels defensible, the executives push back, and a compromise emerges. That number then drives every subsequent architectural decision in this lesson. Without it, the team builds whatever feels prudent, which is usually too much for the small services and not enough for the critical ones.

The four DR tiers

The industry has a rough taxonomy of DR strategies, each landing at a different point on the cost versus RTO/RPO trade-off. Calling them “tiers” makes the ladder explicit: each step up is more expensive than the one below, and the question is which step matches the business’s tolerance.

Tier 1: backup and restore. The cheapest option. Periodic backups go to durable storage; when disaster strikes, the team rebuilds the system from scratch in a new region or account, restores the data, and points traffic at it. RTO is hours to days, depending on how much there is to rebuild. RPO is the backup interval, typically hours. This is the right tier for systems where prolonged downtime is annoying but not existential: internal dashboards, batch reporting, the wiki.

Tier 2: pilot light. A minimal version of the production stack runs continuously in a recovery region: the database is replicating, but the application servers are scaled to zero or near-zero. On disaster, the team scales the application up, points traffic at it, and runs. RTO drops to hours. RPO drops to minutes, because the database has been receiving updates the whole time. The cost is the standing replication infrastructure plus a small footprint of always-on resources.

Tier 3: warm standby. A scaled-down copy of the full stack runs in the recovery region, handling some real traffic or just sitting idle. On failover, the standby scales to full capacity and absorbs all traffic. RTO is minutes; RPO is seconds. The cost is significant: roughly half a production environment running continuously, plus the ongoing operational burden of keeping the two sides in sync (config, secrets, deploy pipelines, schema migrations).

Tier 4: active-active. Both regions are live and serving traffic. Failover is just a traffic-steering decision: cut DNS or load-balancer weights toward the surviving side. RTO is seconds; RPO is zero, modulo the consistency model the team has chosen for cross-region writes. This is the most expensive tier and also the most demanding: every service must be designed to run multi-master, every dataset must be replicated with conflict resolution, every deploy must roll out to both regions in step. Active-active is what banks and ad networks run because the cost of downtime is measured in millions per minute. For most companies, it is overkill.

flowchart LR
    BR[Backup and restore<br/>RTO hours-days<br/>RPO hours]
    PL[Pilot light<br/>RTO hours<br/>RPO minutes]
    WS[Warm standby<br/>RTO minutes<br/>RPO seconds]
    AA[Active-active<br/>RTO seconds<br/>RPO zero]
    BR --> PL --> WS --> AA
    BR -.cheapest.-> AA
    AA -.most expensive.-> BR

Most companies’ DR plan is “we have backups”, which is tier 1, and most of those companies have never tested the restore. The next section is about why this is the central problem.

The drill

Untested DR plans do not work. This sentence is the heart of the lesson and most teams resist it, because testing DR is expensive, disruptive, and rarely produces a deliverable that the next quarter’s roadmap rewards.

The pattern is reliable. A team writes a DR runbook three years ago. The runbook says “in a disaster, restore from S3 backup prod-db-snapshot, restart the workers, point DNS at the recovery region.” Time passes. The S3 bucket gets renamed. The DNS provider gets migrated. The worker pool gets rewritten. The runbook still references the old names. Nobody on the on-call team has ever read it. The team that wrote it left two years ago. On the day of the disaster, the runbook is fiction.

The discipline that fixes this is the drill. A reasonable cadence is quarterly: every quarter, the team picks a scenario (region failure, database corruption, ransomware) and runs through the runbook end to end. Sometimes the drill is a tabletop exercise where the team walks through the steps verbally and confirms they exist. Sometimes it is a full simulation: the team actually fails over to the recovery region, actually restores from a real backup into a real environment, actually times how long it took. Tabletop drills find the worst gaps cheaply; full simulations find the gaps that only show up under load.

Every drill produces a list of things that did not work. The S3 bucket name was wrong. The IAM role required to run the restore had been deleted. The runbook said “page the database team” and the database team has been a different team for eight months. The list goes into the backlog and the next drill checks the fixes. Over time, the runbook starts to be a living document instead of an artefact.

The framing that helps the drill happen is simple: a DR plan that has not been drilled is approximately as good as no plan. The CFO would prefer to know this in a quiet quarter than during a real disaster.

Recover the system, recover the data

A useful distinction. Backups protect data. Runbooks protect time.

A complete DR posture needs both. Backups alone leave the team with a tarball and no way to use it: nothing running, no DNS, no service accounts with the right permissions, no version of the application code that matches the backup’s schema. Runbooks alone, with no real backups behind them, leave the team with a beautiful playbook that ends at “and then the data is back somehow”.

The teams that do DR well treat the two artefacts as a matched pair. Every backup has a corresponding runbook describing how to restore it into a working system. Every runbook has been tested against a real backup. The pair is what gets drilled.

This connects to the backups discussion in lesson 31: a backup that has never been restored is not really a backup, it is an aspiration. The drill is the part that converts aspiration into capability.

The common DR failures

A short catalogue of failures that show up in real recoveries, all real, all preventable.

The backup that does not restore. The team has been running daily backups for years. On the day of the disaster, the restore fails: the backup format changed in a database upgrade, the encryption key has rotated, the bucket policy denies the recovery role. The team finds out on the worst possible day. Prevention: regular restore tests, ideally automated, that confirm the backups can actually be loaded into a working database.

The runbook that references services that no longer exist. Mentioned above. The fix is the drill. A runbook that has not been read in eighteen months is unread fiction.

Credentials nobody on the on-call team has. The recovery requires logging into a vendor console with a credential that lives in one person’s password manager, and that person is on holiday in a country with bad mobile coverage. Prevention: every credential needed for recovery is in a shared secret store with break-glass procedures, and the procedures are part of the drill.

DNS pointing at dead servers. The runbook says “fail over by updating the CNAME”. The CNAME has been updated. Nothing happens, because the TTL is 24 hours and clients are still resolving the old address. Prevention: low TTLs on DNS records that participate in failover, plus a real test of the CNAME swap during the drill.

The recovery region that lacks capacity. The team has a warm standby in another region. On failover, the standby cannot scale up, because the cloud provider has run out of capacity in that instance type in that region. Prevention: capacity reservations, or a pilot light strategy that does not depend on instant scaling at the worst possible time.

The pattern across all of these is the same: the failure is rarely in the design of the DR plan and almost always in the gap between the plan and reality. The drill closes the gap.

How DR connects to multi-region

Lesson 75 covered multi-region as a deployment topology. Multi-region is one specific way to build a DR posture: tier 3 or tier 4, depending on how much of the stack is replicated and how active the standby is. A multi-region active-active deployment is, by construction, a tier 4 DR plan. A multi-region active-passive deployment is a warm standby.

The point worth keeping in mind: multi-region is one tool in the DR kit, not the whole kit. A team running in a single region can still have a serious DR plan, built on backups and pilot-light recovery in a different region or even a different cloud. A team running active-active can still be vulnerable to disasters that propagate across regions: a bad deploy that crashes both sides, a software bug that corrupts the replicated dataset, a ransomware payload that follows the replication path. Active-active does not protect against these; only backups separated from production credentials do. Geographic redundancy and time-separated backups are complementary controls, not substitutes.

What good looks like

A team with a healthy DR posture has, at minimum, the following. RTO and RPO numbers that the business has signed off on. A DR tier matched to those numbers. Backups that are encrypted, durable, and tested by automated restore at least monthly. A runbook that has been drilled within the last quarter. A list of credentials and break-glass procedures that the on-call team has actually used in the last drill. A documented failure scenario, walked through, with the gaps from the last drill in the backlog and the gaps from the previous drill closed.

None of this is glamorous. None of this ships features. All of it is the difference between a serious recovery and a chaotic one when the day comes.

The next lesson is about the security architecture that prevents most of the disasters that DR exists to recover from: the IAM, network segmentation, secret management, and audit logging that keep the threat surface small enough that DR is rarely the answer of last resort.

Citations and further reading

AWS, “Disaster Recovery of Workloads on AWS: Recovery in the Cloud”, https://docs.aws.amazon.com/whitepapers/latest/disaster-recovery-workloads-on-aws/disaster-recovery-workloads-on-aws.html (retrieved 2026-05-01). The canonical write-up of the four-tier model used in this lesson, with concrete RTO/RPO numbers per tier.
Google Cloud, “Disaster recovery planning guide”, https://cloud.google.com/architecture/dr-scenarios-planning-guide (retrieved 2026-05-01). The same conceptual ladder, presented from the GCP perspective, with worked examples for each tier.
Microsoft Azure, “Business continuity and disaster recovery”, https://learn.microsoft.com/en-us/azure/reliability/concept-business-continuity-high-availability-disaster-recovery (retrieved 2026-05-01). Azure’s framing, including useful definitions of business continuity vs DR.
NIST SP 800-34 Rev. 1, “Contingency Planning Guide for Federal Information Systems”, https://csrc.nist.gov/pubs/sp/800/34/r1/upd1/final (retrieved 2026-05-01). The government-grade reference on contingency planning, including the discipline of testing.