Incident response: runbooks, postmortems, the blameless culture

Lesson 60 set the targets. Lesson 61 set the checks that produce the measurements. This lesson is about what happens when the checks fail. SLO burn rate is climbing. The freshness alert has been firing for ninety minutes. The reconciliation test failed with a four-figure variance against the source system. The on-call engineer’s pager just went off. What now?

Incident response is the third pillar of reliability practice, alongside SLOs and quality testing. It is the discipline that turns alerts into recovery and recovery into systemic improvement. Done well, it shrinks the blast radius of failures and captures organisational learning. Done badly, it produces hero culture, blame, postmortems nobody reads, and the slow attrition of on-call engineers.

The Google SRE framework provides the canonical structure (https://sre.google/sre-book/managing-incidents/, retrieved 2026-05-01), but the framework alone is not enough. The cultural and operational disciplines around it are what determine whether a team has real incident response or just paperwork.

What counts as an incident

An incident is anything that breaks (or threatens to break) a tier-1 SLO. Not every bug is an incident. Not every alert is an incident. The signal is “the agreed-upon reliability commitment is at risk if we do not act soon”.

The threshold matters because incidents have overhead. The on-call gets paged. Other engineers may get pulled in. A timeline gets recorded, a postmortem gets written, action items get tracked. That overhead is justified for events that matter and corrosive when applied to events that do not.

Two practical heuristics. If the SLO burn rate over the next hour will consume more than 1% of the monthly budget, declare. If the failure is visible to external customers, declare. A non-critical job that retried successfully, a warning-tier check, a dev environment break: operations, not incidents.

The lifecycle

Google SRE describes incident response as a five-stage lifecycle. Each stage has a purpose, and skipping one is a common failure mode.

Detection. An alert fires. The source might be an SLO burn-rate alert, a data-quality test failure, a synthetic probe, or a customer ticket. The detection mechanism matters because the time between actual failure and alert is unrecoverable response time. A team that learns about incidents from customer tickets has a monitoring problem before it has an incident-response problem.

Triage. The on-call acknowledges the page. The first job is not to fix; it is to decide. Is this real or a false alert? What is the scope? Who needs to be involved? If the runbook covers it, the on-call proceeds. If not, escalation begins.

Mitigation. Stop the bleeding. Restore the SLO. Mitigation is rarely the proper fix; it is the action that gets the system back into the green band fastest. Roll back the deployment. Disable the feature flag. Redirect traffic to a healthy region. Promote a replica. Restart the failing job with reduced load. Mitigation prioritises restoring service over understanding cause.

Resolution. Fix the root cause so the mitigation can be removed. The schema-drift bug gets merged. The capacity issue gets a permanent increase. Resolution can take hours, days, or weeks. Some root causes are hard (the upstream system is owned by another team; the bug is in a third-party library), and the mitigation may stay indefinitely. Acceptable, as long as the gap is tracked.

Postmortem. Write down what happened, why, and what changes prevent recurrence. The postmortem turns a single incident into organisational learning. Without it, every incident is local.

The lifecycle is sequential in spirit but overlapping in practice. Mitigation often starts before triage is complete. Postmortem investigation often runs in parallel with resolution. It is a checklist, not a strict order.

The runbook format

A runbook is the document the on-call reaches for when the alert fires. It has one job: get a sleepy engineer at 3am from “the page just went off” to “I have applied a mitigation and the SLO is recovering” as fast as possible. Format matters. A runbook that is a wall of prose is useless under pressure; a runbook that is a checklist with copy-pasteable commands is gold.

The format that works has five sections.

Symptom. What does the alert look like? What dashboard panels are red? What log patterns indicate this specific failure? Concrete examples, not abstractions. “The customer_etl_freshness SLO is at 95% with the 1-hour burn rate above 14x” is useful. “The pipeline is slow” is not. A good symptom section lets the on-call confirm in thirty seconds that they are on the right runbook.

Likely causes. An ordered list of the historical causes of this alert, with frequency. “1. Upstream Kafka lag (about 60% of incidents). 2. dbt model regression from a recent merge (about 25%). 3. Snowflake warehouse contention (about 10%). 4. Unknown (about 5%).” The ordering helps the on-call investigate the high-probability causes first instead of working through possibilities alphabetically.

Mitigation steps. For each likely cause, the action that mitigates. Copy-pasteable commands. Decision points where judgement is required, with the criteria spelled out. “If Kafka consumer lag exceeds 1 million messages, scale the consumer group: kubectl scale deploy customer-etl-consumer --replicas=8. Verify lag begins decreasing within 5 minutes.” A mitigation step that requires translating prose into a command at 3am has already failed.

Escalation paths. Who to page if the runbook does not resolve the incident, in what order, with what context. “If consumer scaling does not recover within 15 minutes, page the data-platform team lead. If the warehouse is the suspected cause, page the warehouse on-call.” Escalation paths are often missing from runbooks, and the result is on-call engineers struggling alone past the point where help would have arrived faster.

Last reviewed. A date and an owner. Stale runbooks are worse than no runbooks because they convince the on-call to follow advice that no longer applies. The discipline that keeps runbooks fresh is reviewing them after every incident that touched them, and on a quarterly schedule even if no incident did. A runbook last reviewed eighteen months ago is presumed wrong until verified.

The runbook lives next to the code, in the same repository where the service is defined. Operations engineers reading the runbook should be able to read the code. Developers writing the code should update the runbook in the same PR. When runbooks live in a separate wiki, they drift; when they live next to the code, the drift is visible in code review.

The blameless postmortem

The postmortem is where the cultural part of incident response lives, and where most organisations get it wrong. The principle is simple to state and hard to live: the root cause of an incident is never a person.

It is always tempting to identify the human in the chain. The engineer who wrote the bug. The reviewer who approved the PR. The on-call who took ninety minutes to mitigate. The product manager who insisted on the deploy before the holiday weekend. Pointing at any of these people as “the cause” is the path of least resistance, and it is the path that destroys reliability culture, because the next person to make a mistake will hide it instead of reporting it, and hidden mistakes become bigger incidents.

The blameless postmortem reframes the question. The root cause is not “the engineer pushed bad code”. The root cause is “the system allowed bad code to reach production without being caught”. That reframing produces useful action items: better tests, faster CI, canary deploys, better code review tooling, smaller PRs. Pointing at the engineer produces no useful action items; it produces a quieter team and worse incidents.

The format that works for postmortems has six sections.

Summary. Two or three sentences that an executive could read. What happened, when, what was the user impact. The summary is the part most people read; the rest exists for the engineers and analysts who need details.

Timeline. Minute-by-minute or hour-by-hour log of what happened, with timestamps and the people or systems involved. The timeline is constructed from chat logs, alert timestamps, deployment records, and human memory; it is the most labour-intensive part of the postmortem and the most valuable, because it forces the analysis to be specific rather than vague.

Root cause analysis. What underlying conditions allowed the incident? The “five whys” technique works here, but the discipline is to keep asking why until the answer is a property of the system rather than of a person. Why did the bug ship? It passed code review. Why did review miss it? The diff was 2,000 lines. Why was the diff 2,000 lines? The team has no PR-size norm. The action item is the PR-size norm, not “be more careful in review”.

Impact. Who was affected, for how long, with what business consequence? Quantified where possible: the dashboard was wrong for three hours, affecting decision-making in two regional teams; the SLO consumed 40% of monthly error budget. Impact framing helps prioritise action items: a high-impact incident justifies more expensive prevention than a low-impact one.

Action items. Specific, owned, time-bound. “We will improve our processes” is not an action item. “Add a row-count drift alert with 10% threshold to the customer_etl pipeline by 2026-04-15, owned by Maria” is an action item. The actions that come out of postmortems are the actual product of the incident-response practice; without them, the postmortem is a writing exercise.

Lessons learned. What does the team know now that it did not know before the incident? Sometimes the lesson is operational (“the on-call rotation needs to include someone with warehouse expertise”). Sometimes architectural (“the coupling between these two services is tighter than we assumed”). Sometimes cultural (“we need to be willing to declare incidents earlier”). Lessons learned are the thread that connects this postmortem to future ones.

The blameless culture cannot be declared. It has to be modelled, especially by leadership. The first time a senior engineer’s name appears in a timeline (and senior engineers cause incidents at the same rate as junior ones, because the underlying systems are equally fallible from anybody’s keyboard), the team watches what happens. If the name is treated as data, the same as anyone else’s, the culture holds. If the senior engineer is elided or excused, the culture is performative and everyone knows it.

Publication discipline matters too. Postmortems are read across the organisation. Other teams learn from the analysis. New engineers read past postmortems during onboarding to understand how the system fails. Private postmortems capture local learning; published ones capture collective learning.

On-call disciplines

Incident response sits on top of an on-call rotation, and the rotation has its own disciplines that determine whether the engineers stay healthy.

Pager hygiene. Every alert that fires must be actionable. Alerts with no action (“the queue is sometimes deep, no idea what to do”) train the on-call to ignore the pager, and the next real incident is delayed because the on-call has learned not to trust alerts. Review every page after the fact: was it actionable? If not, tune the alert or delete it.

Hand-off. Outgoing on-call briefs incoming on-call: open issues, recent incidents, ongoing investigations, runbook changes. Fifteen minutes. Without it, the incoming on-call walks in cold and the first incident takes longer because they are still loading context.

Compensation. On-call is real work, including the work of being available outside business hours. Teams that pretend on-call is a free addition watch their on-call engineers leave or burn out. Compensation can be time off, additional pay, or both, but it has to be explicit.

Rotation length and frequency. A week is the common rotation, two weeks the maximum before fatigue sets in. The rotation should be short enough that an engineer is on-call once every four to six weeks. Smaller teams have to compromise either on rotation length or on team size (hiring or expanding the pool).

The lifecycle in a diagram

flowchart TD
    A[Alert fires] --> B[Detection]
    B --> C[Triage: real or false?]
    C -->|False| D[Tune alert, log]
    C -->|Real| E[Mitigation: stop the bleeding]
    E --> F[SLO recovering?]
    F -->|No| G[Escalate]
    G --> E
    F -->|Yes| H[Resolution: fix root cause]
    H --> I[Postmortem]
    I --> J[Action items tracked]
    J --> K[Runbook and tests updated]
    K --> L[Next time: faster response or no incident]

The arrow from K back to detection (omitted for clarity) is the loop that makes incident response a learning system. Each incident produces updates: better alerts, better runbooks, better tests, better architecture. The next similar failure either does not happen, or happens with a faster response. Incidents do not stop, but the cost per incident decreases.

The three lessons of this module form a complete loop. SLOs define what reliable means. Quality testing measures it continuously. Incident response handles the moments when measurement says the SLO is at risk. Module 8 continues with monitoring infrastructure and the security and privacy disciplines that overlap with reliability. A team with SLOs, quality tests, and a real incident-response practice has the basic apparatus of a production data platform; everything else is layering.