Capstone: design a complete architecture for a fictional company at three scales

Eighty lessons. Ten modules. Around 150,000 words. The reader who has followed from lesson 1 has, by now, a working vocabulary and a decision framework for system architecture. That is the bar this course aimed at, and the rest of this lesson is the exercise that makes the framework concrete.

The exercise is a single fictional company, designed three times at three scales. The company is a B2B SaaS platform that ingests events from customer applications, runs analytics on those events, and exposes both dashboards and an API. Pick whatever shape lets you fill in the blanks: a product analytics tool, an observability vendor, a customer data platform, a feature flagging service. The exact business does not matter. What matters is that the workload is recognisable: ingest, store, query, serve.

The three scales are 100 users at week one, 100,000 users at series A, and 100 million users at the mature SaaS phase. At each scale the architecture is fully designed: components, infrastructure, team shape, monthly cost. The interesting part is not any single design. The interesting part is what changes between scales and why. Each shift is an architectural decision the course covered. The capstone is the chance to see them stacked.

Scale 1: 100 users, week one of the startup

flowchart LR
    User[Customer browser] --> CDN[CDN]
    CDN --> Static[Static frontend]
    User --> API[Single VM<br/>API + worker]
    API --> DB[(Postgres on the same VM)]
    API --> Stripe[Stripe<br/>billing]
    API --> Sendgrid[Sendgrid<br/>email]

Components:

One virtual machine running the API server and a background worker process side by side. Could be a 4-core EC2 instance, a Hetzner box, or a DigitalOcean droplet. Cost: 50 to 100 EUR per month.
Postgres, running on the same VM, with daily backups to object storage.
A static frontend (Next.js or Astro static export, lesson 6’s preferred shape) on a CDN.
Stripe for billing. Sendgrid or Postmark for transactional email.
Sentry for error tracking, free tier. A free Grafana Cloud account with basic Prometheus metrics, or even just nothing yet.

Total monthly cost: 50 to 100 EUR for infrastructure, plus whatever Stripe and Sendgrid charge based on volume. Team: two or three founders, no dedicated operations.

What is deliberately not there. No Kubernetes. No microservices. No Kafka. No data warehouse. No read replicas. No cache. No CI/CD pipeline beyond a Github Actions deploy script. No observability stack beyond Sentry and the VM’s own metrics. No on-call rotation, because there are three people and they are all on call by default.

The architectural argument for this shape is the lesson 6 and lesson 7 argument. A single Postgres on a single VM serves real businesses up to surprisingly high traffic (the GitLab pre-2018 architecture, the Stack Overflow architecture as documented in their famous 2016 post). The premature-optimisation tax for putting Kubernetes and Kafka in front of the first hundred users is enormous, and the optionality it preserves is illusory because the business will pivot before any of it matters. The shape is what lesson 8 called started-simpler. It is also what survives contact with reality at this stage, because reality at this stage is that the founders need to ship features, not operate infrastructure.

Scale 2: 100,000 users, series A

flowchart LR
    User[Customer browser] --> CDN[CDN]
    CDN --> Static[Static frontend]
    User --> LB[Load balancer<br/>multi-AZ]
    LB --> API1[API pod]
    LB --> API2[API pod]
    API1 --> Cache[(Redis<br/>cache + sessions)]
    API2 --> Cache
    API1 --> Primary[(Postgres<br/>primary)]
    API2 --> Primary
    API1 --> Replica[(Postgres<br/>read replica)]
    API2 --> Replica
    Primary -.replication.-> Replica
    API1 --> Kafka[Kafka<br/>event spine]
    Kafka --> Worker[Worker pool]
    Worker --> Primary
    Worker --> S3[(Object storage<br/>files + backups)]
    Worker --> WH[(Warehouse<br/>BigQuery / Snowflake)]
    API1 --> Obs[Observability stack<br/>Datadog / Grafana Cloud]

Components:

Postgres, now a managed service (RDS or Aurora), with one or two read replicas. The application has been refactored to route reads to replicas where eventual consistency is acceptable (lesson 25 and lesson 26).
A separate worker pool for background jobs, running on the same Kubernetes cluster as the API pods or on a serverless offering like ECS Fargate.
Redis for cache and session storage (lesson 18). Hot reads that used to hit Postgres now hit Redis. Cache invalidation is the team’s first hard problem.
Kafka as the event spine (lesson 42). Customer events flow into Kafka and are consumed by the worker pool, the warehouse loader, and any other downstream system. The event-driven shape of lesson 47 starts to pay off.
Object storage for user-uploaded files and database backups. CDN in front of the static frontend and any user-facing media.
Multi-AZ deployment in one region (lesson 76). The Postgres primary, replica, and Kubernetes nodes are spread across three availability zones. A single AZ outage does not take down the platform.
A warehouse (BigQuery, Snowflake, or Redshift) with data loaded from the operational database via CDC (lesson 46) or scheduled ELT (lesson 33) plus events from Kafka.
A CI/CD pipeline for the application code (lesson 51 and 52) and Terraform or Pulumi for the infrastructure (lesson 53).
A basic observability stack: Datadog or Grafana Cloud, Prometheus metrics, structured logging, distributed tracing on the critical request paths (lesson 59). SLOs defined for the top three user-facing flows (lesson 60).

Total monthly cost: 10,000 to 30,000 EUR, depending on traffic shape and how much of it has been over-engineered. Team: 10 to 30 engineers, one dedicated operations or SRE engineer who probably also has product responsibilities.

The architectural shifts from scale 1 are exactly the ones the course covered. Read replicas because the read traffic now justifies them. Redis cache because the latency budget for the most common queries does not allow another database round-trip. Kafka because the integration count has crossed the threshold where point-to-point connections become a tangle. Multi-AZ because the SLA the contracts now mention requires it. A warehouse because the analytics queries are starting to interfere with the operational database. A CI/CD pipeline because the team is large enough that manual deploys are now an outage source.

What is still not there. No multi-region. No sharding. No microservices, beyond a couple of small services that earned their independence by genuinely needing different scaling characteristics. No dedicated platform team, dedicated SRE team, or dedicated security team. No SOC 2, though the company is starting the process. No GPU fleet. The architecture has more parts than scale 1, but every part earned its place by solving a problem the team was actually having.

Scale 3: 100 million users, mature SaaS

flowchart TB
    subgraph Edge[Edge layer]
        CDN[Global CDN]
        WAF[WAF + DDoS protection]
    end
    subgraph US[US region]
        US_LB[Regional load balancer]
        US_API[API services<br/>Kubernetes cluster]
        US_DB[(Sharded Postgres / Vitess)]
        US_Cache[(Redis cluster)]
        US_Kafka[Kafka cluster]
    end
    subgraph EU[EU region]
        EU_LB[Regional load balancer]
        EU_API[API services<br/>Kubernetes cluster]
        EU_DB[(Sharded Postgres / Vitess)]
        EU_Cache[(Redis cluster)]
        EU_Kafka[Kafka cluster]
    end
    subgraph Data[Data platform]
        Lake[(Lakehouse<br/>Iceberg / Delta)]
        WH[(Warehouse)]
        ML[ML platform]
    end
    Edge --> US_LB
    Edge --> EU_LB
    US_Kafka -.MirrorMaker.-> EU_Kafka
    US_Kafka --> Lake
    EU_Kafka --> Lake
    Lake --> WH
    Lake --> ML

Components:

Sharded Postgres (via Vitess or Citus) or a migration to a distributed database (Spanner, CockroachDB, Aurora). The decision between sharding the existing Postgres and migrating to a distributed database was the multi-quarter project lesson 29 described, and either outcome is defensible.
Multi-region. Active-active for the read-heavy tier, active-passive for tiers that cannot tolerate the conflict-resolution complexity (lesson 75). Customers are routed to their nearest region by GeoDNS, and the data is replicated across regions with whatever consistency model the team has decided to live with.
Microservices boundaries that match team boundaries (lesson 73). The architecture is not “microservices” because microservices was the goal; it is microservices because the team is now hundreds of engineers, and the social cost of a single deployable shared by a hundred engineers is greater than the operational cost of a hundred deployables coordinated through a service mesh.
A full data platform: lakehouse (lesson 37) on object storage with Iceberg or Delta, a warehouse for BI workloads, a feature store, a model registry, an experiment tracker, and the serving infrastructure from lesson 79. Streaming pipelines (lessons 41 to 48) feed everything in near real time.
A full observability stack with custom dashboards per team, distributed tracing across every service, SLOs and error budgets per critical user journey, log retention scaled to comply with the regulatory requirements the company now has.
Dedicated platform teams: a developer experience team owning CI/CD and developer tooling, a data platform team owning the warehouse and lakehouse, an SRE team owning incident response and the on-call programme, an infrastructure team owning Kubernetes and the underlying cloud accounts, a security team owning the compliance certifications and vulnerability management.
Compliance certifications: SOC 2 Type II, ISO 27001, PCI DSS if the company touches card data directly, GDPR and CCPA programmes with the data residency, deletion, and consent infrastructure those require (lesson 78). The compliance programme is itself a multi-million-euro line item.
A formal cost optimisation programme (Module 9), with a FinOps function, monthly review of unit economics, reserved-instance and savings-plan management, storage tiering, and explicit gates on any architecture change that materially affects the bill.

Total monthly cost: a fraction of revenue. Well-run shops keep infrastructure cost under five percent of revenue at this scale; less well-run shops are at fifteen percent and the CFO is asking pointed questions. Team: hundreds of engineers, with the dedicated platform teams above accounting for ten to twenty percent of headcount.

The architectural shifts from scale 2 are again the ones the course covered. Sharding because the single-primary Postgres pattern hit its ceiling (lesson 29). Multi-region because the largest customers will not sign without it (lesson 75). Microservices because the team boundary (lesson 73) requires it. Compliance certifications because the enterprise sales motion requires them (lesson 78). The cost optimisation programme because at this scale, half a percent of revenue is real money.

What is hopefully still recognisable from scale 1: Postgres is still the database, even if it is sharded or wrapped in Vitess or replaced with a Postgres-compatible distributed system. Kafka is still the spine. The static frontend is still served from a CDN. The shape of the system is the same; the constants are different. That is the thread the course has tried to pull on from lesson 1.

What you now know

A paragraph per module, in plain English, on what the reader can now reason about.

Module 1, lessons 1 to 8: foundations. What architecture is, functional versus non-functional requirements, the C4 model, architectural decision records, the language of trade-offs, the first architecture, and when it stops working. You can look at a system and say what its quality attributes are, where the trade-offs were made, and what kind of system it is. You can write an ADR.

Module 2, lessons 9 to 16: distributed systems theory. The eight fallacies, CAP, PACELC, consistency models, time and clocks, consensus, two-phase commit, delivery semantics. You can read a database’s documentation and understand what it is promising and what it is not. You can argue with a vendor about their CAP characterisation without bluffing.

Module 3, lessons 17 to 24: storage primitives. Relational, key-value, document, wide-column, time-series, graph, vector, polyglot persistence. You know what each shape is good at, what it is bad at, and why a real system usually uses three of them.

Module 4, lessons 25 to 32: replication and partitioning. Replication patterns, replication lag, partitioning, hot keys, sharding strategies, split-brain, cross-shard queries, and the Discord case study. You can reason about a database scaling roadmap.

Module 5, lessons 33 to 40: batch. ETL versus ELT, batch fundamentals, Spark, the medallion architecture, the lakehouse, idempotent batch, backfills, and the Netflix case study. You can design a batch pipeline that does not corrupt the warehouse on its third retry.

Module 6, lessons 41 to 48: streaming. Why streaming, Kafka, stream processors, event-time and watermarks, exactly-once, CDC and the outbox, lambda versus kappa, and the Uber case study. You can read a streaming architecture diagram and understand the trade-offs in every box.

Module 7, lessons 49 to 56: deployment. Git branching, trunk-based development, CI and CD for data, IaC, Docker, Kubernetes, and the Stripe case study. You can take a team deploying by hand on Fridays and walk them through the path to deploying twenty times a day.

Module 8, lessons 57 to 64: operations. Orchestration, asset-oriented orchestration, observability, SLOs and error budgets, data quality, incident response, on-call, and the Airbnb case study. You can take an existing platform and turn it from “a pile of pipelines” into “something a team can run”.

Module 9, lessons 65 to 72: cost. The cost iceberg, storage costs, network costs, compute right-sizing, query cost, GPU costs, the data egress problem, and the rebuild-versus-buy calculus. You can read a cloud bill and know what to ask.

Module 10, lessons 73 to 80: scaling-out concerns. Microservice boundaries, Conway’s Law, multi-region patterns, high availability beyond multi-AZ, regulatory and compliance considerations, the ML platform, and this capstone. You can join a senior architecture conversation and follow it.

That is the working vocabulary. Whether it is enough depends on what the reader does with it next.

Where to go next

Five concrete recommendations for what to read after this course.

Designing Data-Intensive Applications, by Martin Kleppmann. Read it cover to cover. The course was a tour; the book is the depth. Every topic in Modules 2 through 6 has a chapter that goes deeper, and many references in this course point back to it. The 2017 edition is still the canonical text in 2026.

The SRE Book and the SRE Workbook, both free at https://sre.google/books/. The Google SRE team’s published practices on operating large systems. The course covered the SRE concepts (SLOs, error budgets, on-call, incident response) at the architecture level; these books are the practitioner depth. The Workbook in particular is full of templates and worked examples that translate directly into runbooks.

Engineering blogs. Subscribe to the publishing engineering teams whose case studies the course drew from: Netflix Tech Blog, Uber Engineering, Stripe Engineering, Discord Engineering, Pinterest Engineering, Airbnb Engineering, Cloudflare Blog, Figma Engineering. The signal-to-noise is high; a year of reading will refresh and extend most of the content from this course.

The other courses on this site. This course is the architecture lens; the SQL Server course covers the database layer at depth, the PySpark course covers the modern batch and streaming engine, and the Python course covers the application and ML layers. Together they are the full data-engineering toolbox.

A real workload. The strongest learning loop is the one where the reader takes a real system at their job, applies the framework from this course, finds the trade-offs that were made implicitly, and either documents them as ADRs or proposes changes that would justify a different trade-off. The course content goes from passive vocabulary to active practice the moment it gets used.

Closing

You now have the vocabulary. You can read a job posting, a vendor pitch, an engineering blog post, an architecture diagram, or a senior engineer’s whiteboard sketch, and you can follow along. You can ask the question that exposes the assumption. You can name the trade-off that was made implicitly. You can sketch your own architecture and argue for the choices you made. That is what eighty lessons of system architecture were for.

The remainder is practice. The systems that move the world are the ones that survived contact with reality, and reality is what you will be working with starting tomorrow. Constants differ. Shapes do not. Go build something.

Citations and further reading

Martin Kleppmann, “Designing Data-Intensive Applications”, O’Reilly Media, 2017, https://dataintensive.net/ (retrieved 2026-05-01). The depth reference for nearly every topic in this course.
Beyer, Jones, Petoff, Murphy (eds.), “Site Reliability Engineering: How Google Runs Production Systems”, O’Reilly Media, 2016, free at https://sre.google/sre-book/table-of-contents/ (retrieved 2026-05-01). The foundational SRE text.
Beyer, Murphy, Rensin, Kawahara, Thorne (eds.), “The Site Reliability Workbook: Practical Ways to Implement SRE”, O’Reilly Media, 2018, free at https://sre.google/workbook/table-of-contents/ (retrieved 2026-05-01). The hands-on companion volume.
Netflix Tech Blog, https://netflixtechblog.com/ (retrieved 2026-05-01).
Uber Engineering, https://www.uber.com/blog/engineering/ (retrieved 2026-05-01).
Stripe Engineering, https://stripe.com/blog/engineering (retrieved 2026-05-01).
Discord Engineering, https://discord.com/category/engineering (retrieved 2026-05-01).
Pinterest Engineering, https://medium.com/pinterest-engineering (retrieved 2026-05-01).
Airbnb Engineering, https://medium.com/airbnb-engineering (retrieved 2026-05-01).
Cloudflare Blog, https://blog.cloudflare.com/ (retrieved 2026-05-01).
Figma Engineering, https://www.figma.com/blog/engineering/ (retrieved 2026-05-01).