Data & System Architecture, from the ground up Lesson 77 / 80

Security architecture: least privilege, defense in depth

The security principles every system needs as load-bearing architecture. Least privilege, defense in depth, zero trust, and the IAM and network controls that turn principles into reality.

Security is the part of architecture that most teams treat as somebody else’s job until the day it isn’t. The cloud security team writes the policies, the platform team runs the IAM, the data engineers wire pipelines together and assume that whatever permissions they need will be granted on request. The result is a system where no one person has end-to-end mental model of who can read what, and a small misconfiguration somewhere produces a story in the news.

The framing this lesson takes is that security is load-bearing architecture, the same way consistency, partitioning, and replication are load-bearing. It is not a layer that gets bolted on at the end; it is a set of constraints that shape every component. A data platform built on the wrong assumptions about identity and permissions will spend years retrofitting controls that should have been there at the start.

The good news is that the principles are few and the patterns are well understood. The work is in applying them consistently, not in inventing them.

Three principles

Three principles cover most of what a working architect needs to internalise. They reinforce each other, and a system missing any one of them has a brittle posture.

Least privilege. Every actor in the system, whether a human user, a service account, or a role assumed by a workload, gets exactly the permissions it needs to do its job, and no more. The default is nothing; permissions are granted explicitly. The phrase that matters is “default deny”: if a permission has not been explicitly granted, the action is forbidden. This sounds bureaucratic until the day a service account gets compromised and the blast radius is exactly the small set of actions it was authorised to do, rather than the entire AWS account.

The practical version is that every IAM role starts empty and grows by accretion as the team identifies what the workload actually does. Wildcard permissions like Action: * are tempting because they remove a class of small frictions, but they replace those frictions with a single enormous risk. A role that can do anything is a role that, when stolen, can do anything.

Defense in depth. Assume any single layer of the security architecture can be breached. The system stays safe because there are multiple layers between the attacker and the crown jewels. A perimeter firewall, plus authentication on every internal request, plus authorisation on every operation, plus encryption of data at rest, plus audit logging that records who did what. An attacker who gets past the firewall still has to authenticate; an attacker who steals a credential still has to find a way to use it that the audit log will not flag; an attacker who exfiltrates data still finds it encrypted with keys they cannot reach.

The principle is honest about human limits. Every layer will have bugs. Every layer will be misconfigured occasionally. The question is whether a single bug or misconfiguration is enough to compromise the system, or whether several would have to align. Defense in depth makes the second case more likely than the first.

Zero trust. The classic security model assumed that requests originating inside the corporate network were trustworthy and requests from outside were not, and the firewall was the line that separated them. Zero trust rejects this. Network location is not a credential. Every request, whether it comes from a laptop on the office Wi-Fi or from a mobile phone on a coffee-shop network, must be authenticated and authorised on its own merits. This becomes especially important once workloads are spread across cloud regions, on-prem data centres, vendor SaaS, and contractor laptops, all of which have different network postures.

The architectural consequence is mTLS between services, identity-aware proxies in front of internal applications, and conditional access policies that consider device posture, location, and behaviour rather than just network membership.

The architectural pieces

Principles need implementations. The components below are the pieces every serious data platform has, in some form, and a platform missing any of them is exposed in a way the team should be honest about.

Identity and access management. IAM is the system of record for who exists and what they can do. Every cloud has its own (AWS IAM, Azure Entra ID, GCP IAM); every enterprise typically has a central identity provider on top, often Okta or Azure Entra, that issues tokens used across cloud and SaaS systems via SAML or OIDC. The architectural goal is one source of truth for identity, federated everywhere, with single sign-on for humans and short-lived credentials for workloads.

The “short-lived credentials for workloads” part deserves emphasis. Long-lived access keys baked into config files are a perennial source of breach. The modern pattern is workloads assuming roles by virtue of where they run: an EC2 instance assumes a role through the instance metadata service; a Kubernetes pod assumes a role through IRSA or Workload Identity; a CI runner assumes a role through OIDC federation with the source-control vendor. The credential exists only inside the workload, only for the duration of its task, and never appears in any persisted form.

Network segmentation. Even with zero trust as a principle, network controls are a useful additional layer. Production VPCs separated from development. Subnets segmented by tier, with security groups that allow only the traffic each tier needs. Kubernetes network policies that enforce default-deny between namespaces and explicit allow rules between services that need to talk. The goal is not to make the network the security boundary; it is to ensure that an attacker who reaches one part of the network has not automatically reached every other part.

Secret management. Secrets are credentials, API keys, signing keys, encryption material. They do not belong in source control, in environment variables baked into Docker images, or in shared spreadsheets. They belong in a dedicated secret store: AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, HashiCorp Vault, or one of the SaaS options like Doppler or Infisical. The store provides versioning, rotation, audit logging, and fine-grained access control. Workloads pull secrets at startup, never write them to disk, and rotate them on a schedule the security team can verify.

The classic anti-pattern is the production password committed to a private repository. Private repos leak. Backups of private repos leak. Forks made for debugging leak. Anything in a git history is, on a long enough timeline, public.

Audit log. Every administrative action and every sensitive data access writes to a log that is append-only, immutable, and stored separately from the production environment. CloudTrail, GCP Audit Logs, Azure Monitor activity logs, plus application-level audit events for the data plane. The log answers questions the team will eventually need to answer: who deleted this resource, when did this user access this customer’s data, which IAM role made this API call. A platform that cannot answer these is not in compliance with most regulatory frameworks, and is also unable to investigate its own incidents.

The “stored separately” part matters. An attacker who reaches production should not also reach the audit log, or they will erase the trail behind them. The standard pattern is shipping audit logs to a different account, often with write-only credentials from production into the log account.

The shared-responsibility model

Cloud changes the security boundary in ways teams sometimes get wrong. The shared-responsibility model is the explicit contract between cloud provider and customer: the provider secures the parts they control; the customer secures the parts they control; the boundary depends on the service.

For raw infrastructure (IaaS, like EC2): the provider secures the physical data centre, the hypervisor, the network fabric. The customer secures the operating system, the application code, the data on the volumes, the IAM that grants access to all of it.

For managed platforms (PaaS, like RDS or BigQuery): the provider also secures the database engine, the operating system underneath it, the patching and backups of the engine. The customer still secures the data inside the database, the IAM that grants access, the network controls in front of it, the application code that issues queries.

For SaaS (like Snowflake, GitHub, Datadog): the provider secures most of the stack. The customer is responsible for IAM (who in their org can log in and as what), data classification (what they upload), and integration controls (how the SaaS connects to their other systems).

The mistake teams make is assuming the cloud provider’s responsibility extends further than it does. A provider’s “encrypted by default” usually means encryption at rest using their key, which protects against a stolen disk, not against an attacker who has stolen IAM credentials and can ask the service to decrypt. A provider’s network security controls do not stop misconfigured public buckets. The customer’s responsibilities exist, and the team has to know which ones apply to which service.

Defense in depth, drawn

flowchart LR
    A[Attacker]
    P[Public surface<br/>WAF, DDoS, rate limiting]
    AU[Authentication<br/>SSO, MFA, mTLS]
    AZ[Authorization<br/>IAM, RBAC, ABAC]
    N[Network controls<br/>VPC, subnets, security groups]
    E[Encryption<br/>at rest, in transit]
    S[Sensitive resource<br/>customer data]
    L[Audit log]
    A --> P --> AU --> AZ --> N --> E --> S
    P -.events.-> L
    AU -.events.-> L
    AZ -.events.-> L
    N -.events.-> L
    E -.events.-> L
    S -.events.-> L

The visual point is that the sensitive resource sits behind several controls, each independent of the others. An attacker has to defeat each in turn. Every control writes to a separate audit log that an investigator can use to reconstruct what happened. A breach of any single layer does not compromise the resource on its own.

Architectural patterns to internalise

A short list of patterns that should be the default in 2026, not exceptions reserved for the most regulated systems.

Encryption at rest and in transit, by default. Every database encrypted on disk. Every connection between services using TLS. Every backup encrypted in the bucket. The cost of this is approximately zero in modern cloud services and the benefit, on the day a disk is stolen or a network tap is installed, is enormous.

Mutual TLS between services. Every service authenticates every other service with a client certificate, not just an API key. Service meshes (Istio, Linkerd, Consul) provide mTLS automatically. The mesh removes the temptation to skip the work because it was hard to configure manually.

Short-lived credentials, never baked in. No access keys in environment variables baked into images. Workloads assume roles at runtime; tokens expire in minutes to hours. Rotation happens automatically, not as a quarterly project.

The audit log backed up separately. Production cannot reach the log archive. The log archive cannot be modified, only appended to. Retention matches the longest regulatory requirement the business is subject to.

Common architectural mistakes

The mirror image of the patterns. Each of these has produced a public breach in the last few years; none is rare; all are preventable.

Wildcard IAM. A role with Action: * on Resource: * because someone needed to ship and the smaller permission set was hard to compute. The role gets stolen and the attacker has the entire account. Prevention: tooling that infers needed permissions from real usage, plus review gates on broad policies.

Public S3 buckets. A bucket marked publicly readable because it was the easiest way to share a file with a contractor. The bucket also contains, three years later, a copy of a backup nobody remembered. Prevention: account-wide block-public-access settings and tooling that scans for buckets that bypass them.

Default credentials. A database admin password set to something memorable when the cluster was provisioned and never rotated. An admin panel reachable on the public internet with admin / admin. Prevention: provisioning automation that generates random passwords stored in the secret manager, and human-readable defaults made impossible by configuration.

Secrets in git. API keys in .env files committed accidentally. SSH private keys checked in for “convenience”. Prevention: pre-commit hooks that scan for secret patterns, repository scanning by the source-control vendor, and rotation procedures for any secret that has ever appeared in a commit.

The pattern in all four mistakes is the same: a small short-cut taken under deadline pressure, in a context where the consequence was abstract. The architectural defence is to make the short-cuts harder to take than the right path.

What good looks like

A team with a healthy security posture has IAM as the system of record for identity, federated through a central provider, with workloads using short-lived credentials and humans using SSO with MFA. Network segmentation between environments, with default-deny rules and explicit allow lists. A secret manager with rotation enforced. An audit log shipped to a separate account, backed up, and queried regularly. Encryption on by default for every data store and every connection. A documented threat model that the team revisits when the architecture changes.

This is not a security team’s job alone. It is a property of the platform, maintained by the platform’s owners as part of running it. Lesson 53’s infrastructure-as-code discussion connects directly: every IAM policy, every network rule, every secret-store configuration belongs in code, reviewed, and version-controlled. Security configuration that lives only in the cloud console is security configuration that drifts.

The next lesson is about the regulatory frameworks that turn many of these patterns from “good practice” into legal obligation: GDPR, CCPA, the privacy laws that have architectural implications most teams underestimate until the first regulator email arrives.

Citations and further reading

  • NIST SP 800-207, “Zero Trust Architecture”, https://csrc.nist.gov/publications/detail/sp/800-207/final (retrieved 2026-05-01). The reference document for zero trust as a formal architectural model.
  • NIST Cybersecurity Framework 2.0, https://www.nist.gov/cyberframework (retrieved 2026-05-01). The high-level framework that ties identity, protection, detection, response, and recovery together.
  • CIS Controls v8, https://www.cisecurity.org/controls/v8 (retrieved 2026-05-01). The practical, prioritised list of controls every organisation should implement, in roughly the order they should implement them.
  • AWS, “Best practices for IAM”, https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html (retrieved 2026-05-01). The canonical AWS IAM advice, including the wildcard-permission warning and the workload-credential patterns referenced above.
  • OWASP Top 10, https://owasp.org/www-project-top-ten/ (retrieved 2026-05-01). The application-layer mirror of the platform-layer concerns covered here.
Search