Event-driven architecture: saga, choreography, orchestration

Lesson 73 ended on the modular monolith, and on the case that for most teams the default backend shape is one deployable with strong internal boundaries. This lesson assumes the question past that: when you do have multiple services (whether two or two hundred), how should they talk to each other? The synchronous-RPC answer (service A calls service B over HTTP, waits for the response, returns to the caller) is the default that ships first and often the one that calls the team back six months later with a list of failure modes. Event-driven architecture is the alternative pattern, and the industry has spent a decade learning where it works and where it does not.

The shorthand: in event-driven architecture, services do not call each other. They emit events. Other services subscribe and react. The event bus (often Kafka, often a managed equivalent) sits between them and decouples producers from consumers.

The benefits are real. The trade-offs are different from synchronous RPC, not strictly better, and the choice between the two ways to coordinate cross-service workflows (choreography and orchestration) is the second decision the team has to make once they have committed to events. This lesson covers the pattern, the two flavours, the saga pattern that is often implemented on top, and the 2026 toolset that has converged around it.

Synchronous RPC and what it costs

Start with the failure mode the event-driven pattern is reacting against. In a synchronous-RPC backend, a request from the user hits service A, which calls service B, which calls service C, which calls service D. Each call is an HTTP or gRPC request. Each call blocks until the next service responds. The whole call chain holds open while the request walks through the system.

The problems compound:

Latency adds up. The user-visible latency is the sum of every hop. A 50ms p95 at each of four services is a 200ms p95 end to end, before the network and the database.

Tail latency dominates. If service D has a 99th-percentile of 500ms, and your request fans out to many such calls, your end-to-end p99 is much worse than any one service’s p99.

Failure cascades. When service D is down, service C times out. When service C times out, service B times out. The whole chain is down because the deepest link is down. Circuit breakers and retries help; they do not eliminate the coupling.

Coupling at deploy time. Service A knows the address of service B and the shape of B’s API. When B changes, A has to be updated. Multiply by N services.

These are not bugs in synchronous RPC; they are the consequences of the model. The pattern has its place, and most user-facing read paths are easier to reason about as RPCs. But for write paths and for cross-service workflows, the cost adds up.

The event-driven shape

The alternative: service A emits an event (“OrderPlaced”, with the order details in the payload) and returns to its caller. Services B, C, and D have subscribed to OrderPlaced events. They each consume the event and do their work asynchronously. The user-visible request finishes at A; the rest happens in the background.

The substrate is usually a durable log: Kafka (lesson 42), Pulsar, AWS Kinesis, Google Pub/Sub, Azure Event Hubs. The producer writes once. Many consumers read independently. The log persists the events so that a slow or temporarily-down consumer can catch up later. Delivery semantics (lesson 16) dictate the rest of the contract.

The benefits invert the synchronous-RPC failure modes:

Latency for the user is just service A’s latency. The downstream work happens out of band.

Failure isolation. Service D being down is service D’s problem. Service A returns to the user normally; the events queue up in the log; service D catches up when it comes back.

Loose coupling. Service A does not know who consumes OrderPlaced. New consumers can be added without changing A. Old consumers can be removed without changing A.

Auditability. The event log is a record of everything that happened, in order. Replaying the log reconstructs state. Debugging walks back through the events.

The trade-offs are also real:

Eventual consistency. The state across the system is consistent eventually, not immediately. The user sees their order placed; the email confirmation arrives a second later. For most workflows this is fine. For some it is not.

Harder debugging. “What happens when an order is placed?” is no longer answerable by reading service A’s code. The answer is spread across every consumer of OrderPlaced.

Schema evolution. The event schema is now a contract between services. Changing it is a coordinated change.

Operational substrate. Running a Kafka cluster (or its managed equivalent) is its own operational discipline.

The pattern is worth the cost when the workflow is naturally asynchronous, when downstream work can fail and retry independently, and when the team is willing to invest in the substrate. It is overkill for the simple case where a request just needs a database write and a response.

Choreography: services react

Once events are the substrate, you have two ways to express a multi-step workflow. The first is choreography. Each service knows its own job. When it sees an event it cares about, it does its work and emits its own event. There is no central coordinator. The workflow is the emergent behaviour of services reacting to each other.

A canonical example: an order-placement workflow.

sequenceDiagram
    participant U as User
    participant O as OrderService
    participant K as Kafka
    participant P as PaymentService
    participant I as InventoryService
    participant N as NotificationService

    U->>O: POST /orders
    O->>K: emit OrderPlaced
    O-->>U: 201 Created
    K->>P: OrderPlaced
    P->>K: emit PaymentAuthorized
    K->>I: PaymentAuthorized
    I->>K: emit InventoryReserved
    K->>N: InventoryReserved
    N->>N: send confirmation email

The order service knows nothing about payments, inventory, or notifications. It emits OrderPlaced and returns. The downstream services subscribe and react. Each service is small, focused, and independently deployable.

The strengths of choreography:

Loose coupling. Adding a new participant (analytics, fraud detection, partner notifications) is a matter of subscribing to the relevant events. No existing service changes.

No single point of failure for the workflow. Each service is independent.

Natural fit for genuinely independent reactions. When N services need to react to the same event in parallel and do not depend on each other, choreography is the obvious shape.

The weaknesses:

No single source of truth for the workflow. Asking “what happens when an order is placed” requires reading the source of every consumer.

Hard to see “where is order #123 stuck?”. The order is whatever the events say it is. Debugging a stuck order means reading multiple service logs and reconstructing the state from the event sequence.

Implicit dependencies. Service C depends on service B emitting PaymentAuthorized. If B changes its event name or payload, C breaks. The dependency is real but invisible in the code.

Compensation logic is scattered. When the workflow needs to roll back (payment authorised but inventory not available), the compensation has to be coordinated across services that do not know about each other.

Orchestration: a coordinator drives

The alternative is orchestration. A central coordinator (an orchestrator, a saga manager, a workflow engine) holds the workflow definition and tells each service what to do, in order, with explicit handling of failures and compensations.

The same order-placement workflow as an orchestrated flow:

sequenceDiagram
    participant U as User
    participant O as OrderService
    participant W as WorkflowEngine
    participant P as PaymentService
    participant I as InventoryService
    participant N as NotificationService

    U->>O: POST /orders
    O->>W: start OrderWorkflow
    O-->>U: 201 Created
    W->>P: AuthorizePayment
    P-->>W: success
    W->>I: ReserveInventory
    I-->>W: success
    W->>N: SendConfirmation
    N-->>W: success
    W->>W: workflow complete

The workflow engine owns the state of the workflow. It knows the current step, the next step, and what to do on failure. The participating services are passive: they expose actions, and the orchestrator invokes them.

The strengths of orchestration:

Single source of truth. The workflow definition is one piece of code or configuration that captures the whole flow.

Visibility. The orchestrator’s UI shows where every workflow is. “Where is order #123 stuck?” is a query against the orchestrator, not a forensic exercise.

Explicit error handling and compensation. The workflow definition states what happens on failure. Retry this step. Compensate that one. Escalate to a human after three failures.

Easier reasoning about complex workflows. When the workflow has ten steps with conditional branches and timeouts, orchestration makes the structure visible. Choreography hides it.

The weaknesses:

The orchestrator is a critical service. If it goes down, no workflows progress. Production deployments make it highly available, but the operational responsibility is real.

Tighter coupling between the orchestrator and the participants. The orchestrator knows the participants by name and address. Adding a new participant means changing the workflow definition.

A new piece of infrastructure to run. Choreography needs only the event bus. Orchestration needs the orchestrator on top.

The choice between choreography and orchestration is not absolute. Most mature systems use both: choreography for the loose-coupling cases (analytics, audit, broadcast notifications), orchestration for the complex multi-step business workflows (order processing, account onboarding, claim adjudication).

The saga pattern

The saga pattern is the canonical solution to the problem 2PC (lesson 15) tries and fails to solve at scale: how do you coordinate a transaction that spans multiple services, each with its own database, when distributed locking and 2PC are too expensive?

A saga expresses the long-running transaction as a sequence of local transactions, one per service, with a compensation defined for each. If step three fails, the saga runs the compensations for steps one and two in reverse, leaving the system in a consistent (if not identical) state.

The saga in the order example:

AuthorizePayment. Compensation: VoidAuthorization.
ReserveInventory. Compensation: ReleaseReservation.
SendConfirmation. (No compensation needed; an extra email is harmless.)

If ReserveInventory fails, the saga runs VoidAuthorization. The user sees an order rejection; the payment is not charged; the system is consistent.

The saga can be implemented either way. As choreography: each service emits a success event when its step succeeds and a failure event when it fails; downstream services compensate when they see a failure. As orchestration: the orchestrator runs the steps in order and explicitly invokes the compensations on failure.

The orchestrated saga is the dominant pattern in 2026 because the orchestrator’s visibility makes the saga’s state debuggable, which is the part of the saga that hurts most in production.

The 2026 toolset

The orchestration tooling has consolidated. The four serious contenders for code-driven workflow orchestration:

Temporal is the dominant choice for code-defined workflows. Workflows are written as ordinary code (Go, Java, TypeScript, Python) and Temporal makes them durable: state is checkpointed automatically, restarts resume from the last checkpoint, retries and timeouts are first-class. The mental model is “write the workflow as a function, and Temporal handles the failures”. Temporal Cloud is the managed offering.

AWS Step Functions is the AWS-native answer. Workflows are defined in Amazon States Language (a JSON DSL) and run inside the AWS account. Tight integration with the rest of AWS (Lambda, ECS, SQS, DynamoDB) is the main draw. The trade is lock-in to AWS.

Camunda is the business-process flavour. Workflows are defined in BPMN (Business Process Model and Notation), the XML standard from the business-process world. The audience is teams where business analysts edit the workflow alongside engineers. Camunda 8 (the cloud-native rewrite) replaced the older Camunda 7 (Activiti-based) starting around 2022.

Argo Workflows is the Kubernetes-native option. Workflows are defined as Kubernetes custom resources, each step is a pod, and the orchestrator is itself a Kubernetes controller. The fit is data and ML pipelines that already run on Kubernetes; the lesson 57 framing covered Argo from the data-orchestration angle.

The choice usually falls out of the team’s existing infrastructure. Heavy AWS shop with engineers comfortable in JSON: Step Functions. Greenfield team that wants the workflow logic to be ordinary code: Temporal. Enterprise with BPMN traditions: Camunda. Kubernetes-native data team: Argo.

Cross-references and the next step

The pattern connects to several earlier lessons. Lesson 15 introduced 2PC and the saga as the alternative for cross-service consistency. Lesson 16 laid out the delivery semantics that the event bus has to honour. Lesson 42 went deep on Kafka, the most common spine for event-driven systems. Lesson 57 covered orchestrators from the data-pipeline angle; this lesson is the same framing applied to cross-service business workflows.

Lesson 75 closes the module’s opening by pulling out a layer further: once the services and their workflows are settled, where do they run, geographically? Multi-region deployments turn the same architectural questions on their side, and the failure modes of cross-region replication are the next thing the team has to think about.

Citations

Temporal documentation, https://docs.temporal.io/, retrieved 2026-05-01. The dominant code-driven workflow engine.
AWS Step Functions documentation, https://docs.aws.amazon.com/step-functions/, retrieved 2026-05-01.
Camunda 8 documentation, https://docs.camunda.io/, retrieved 2026-05-01.
Argo Workflows documentation, https://argo-workflows.readthedocs.io/, retrieved 2026-05-01.
Chris Richardson, “Microservices Patterns”, Manning, 2018. Standard reference for the saga pattern, choreography vs orchestration, and event-driven design.
Sam Newman, “Building Microservices, 2nd Edition”, O’Reilly, 2021. Chapter on inter-service communication.
Hector Garcia-Molina and Kenneth Salem, “Sagas”, ACM SIGMOD 1987. The original saga paper.