Data & System Architecture, from the ground up Lesson 53 / 80

Infrastructure as code: Terraform, Pulumi, CDK

Declarative infrastructure, the state-file problem, the GitOps workflow. Three tools and where each fits.

The difference between a team that can stand up a fresh staging environment in thirty minutes and a team that needs three days is almost always how their infrastructure is defined. The thirty-minute team has it in code: a repository of declarative files that describes every cloud resource, network, database, IAM role, DNS record, and secret. They run a command, the cloud reconciles to the description, and a working environment exists. The three-day team has it in someone’s memory and a half-finished wiki page that was last updated when the AWS console looked different.

Infrastructure as code, IaC, is the discipline of defining infrastructure the same way you define application code: in version-controlled, reviewed, tested files. This lesson walks through the three tools that dominate the space (Terraform, Pulumi, AWS CDK), the state-file problem they all share, the GitOps workflow that wraps them, and why this matters more for data work than for most other kinds of engineering.

The pre-IaC alternative

To make the case for IaC, look at what it replaces.

Pre-IaC, infrastructure happens in a cloud console. Someone clicks “create database” and types in the parameters. Six months later, nobody remembers what the parameters were. The database needs to be reproduced in staging; the engineer doing the reproduction guesses at the parameters and gets some of them wrong. A year in, the production database and the staging database have drifted enough that bugs in staging do not reproduce in production, and bugs in production do not reproduce in staging. Each environment is a snowflake. Each change is a manual ritual.

The compounding effect is worse than the per-incident pain. Every cloud account accumulates debt: orphan resources nobody owns, security groups with rules whose purpose is forgotten, IAM policies that are too permissive because nobody knows what minimum permission the service actually needs. Audit becomes a forensic exercise. Disaster recovery becomes a hope.

IaC makes infrastructure reviewable, reproducible, and rollback-able. The trade is that you have to learn the tooling and accept the constraints of declarative configuration. For anything beyond a one-VM setup, the trade pays for itself within months.

Terraform: the default

Terraform, released by HashiCorp in 2014, is the tool that most teams reach for. It uses a declarative DSL called HCL (HashiCorp Configuration Language) and a provider model that lets it talk to nearly any cloud or service: AWS, GCP, Azure, Kubernetes, Cloudflare, Datadog, GitHub, and a long tail of others. You write HCL files describing the desired state, run terraform plan to see what would change, run terraform apply to make the changes happen.

A small example for an S3 bucket:

resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-prod"
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

The strengths are real. The provider ecosystem is enormous. The community has decades of patterns and shared modules. The plan-then-apply workflow gives reviewers a concrete diff of what will change in cloud reality. The same Terraform code works for dev, staging, and prod with different variables, which is what makes the thirty-minute fresh environment possible.

The licensing situation is worth knowing. In 2023, HashiCorp re-licensed Terraform from MPL to the Business Source License, restricting some commercial use. The community responded by forking the open-source code into OpenTofu, now a Linux Foundation project. OpenTofu is a drop-in replacement and the open-source path forward; new teams choosing today often default to it. The HCL files work in both.

For most teams most of the time, Terraform or OpenTofu is the right starting point. The provider coverage and community make it the path of least resistance.

Pulumi: infrastructure in real languages

Pulumi, released in 2017, took a different bet. Instead of a custom DSL, it lets you define infrastructure in real programming languages: TypeScript, Python, Go, C#, Java. The same declarative model underneath, the same plan-then-apply workflow, but the input is code in a language your team already knows.

The same S3 bucket in TypeScript:

import * as aws from "@pulumi/aws";

const dataLake = new aws.s3.Bucket("data-lake", {
    bucket: "company-data-lake-prod",
    versioning: { enabled: true },
});

The advantages show up when the infrastructure has any complexity. Loops, conditionals, abstractions, type checking, IDE autocomplete, real testing frameworks: all of it free, because it is just code. Teams that have outgrown Terraform’s for_each and dynamic blocks often find Pulumi clearer for the same logic.

The cost is that the abstraction is leakier. With HCL, the file is the truth: read the file, you know what infrastructure exists. With Pulumi, the code is a program that produces the truth, and a sufficiently clever program can be hard to read. The discipline that keeps Pulumi clean is the discipline of any codebase: simple is better than clever, abstractions earn their keep, the new hire should be able to read what is happening.

Pulumi is the right pick for teams whose engineers are stronger in code than in DSLs and whose infrastructure logic has grown enough complexity that HCL is fighting back. For a small team with simple needs, the extra power is overkill.

AWS CDK: AWS-only with code-native ergonomics

The AWS Cloud Development Kit, released in 2019, is Amazon’s answer to the same “real languages” pitch as Pulumi, with a tighter scope. CDK code in TypeScript, Python, Java, C#, or Go compiles down to CloudFormation templates that AWS deploys.

The same bucket in CDK:

import * as s3 from "aws-cdk-lib/aws-s3";

new s3.Bucket(this, "DataLake", {
    bucketName: "company-data-lake-prod",
    versioned: true,
});

CDK is AWS-only. That is the whole point. If your infrastructure is AWS top to bottom and you want the highest-level abstractions for AWS resources, CDK gives you constructs that bundle common patterns (an ECS service with a load balancer and a CloudWatch dashboard, all configured sensibly) into single objects. The integration with AWS-specific features is deeper than what Terraform’s AWS provider offers, because CDK is built by AWS.

The trade is lock-in. CDK is not portable. If the company moves to multi-cloud, the CDK code does not come along. For a team that is committed to AWS for the long term and wants the developer ergonomics, that lock-in is a feature, not a bug. For a team that wants to keep options open, Terraform or Pulumi is the safer bet.

The state-file problem

Every IaC tool has to track “what currently exists in the cloud” so it knows what to change on the next apply. Terraform writes a state file: a JSON document mapping the resources in your code to the resources that exist in reality, with their IDs, their attributes, their dependencies. Pulumi has the same concept under a different name. CDK delegates to CloudFormation, which keeps its own state.

The state file is where most IaC operational pain comes from.

It is a critical piece of state. Lose it and you lose the link between code and reality. The next apply will try to create everything fresh, fail because the resources already exist, and leave you in a broken state. Treat the state file like a database backup: stored remotely (S3 with versioning is the canonical Terraform setup), encrypted, and not in the git repo.

It can be locked. Two engineers running terraform apply against the same state at the same time produces corruption. Terraform supports locking through DynamoDB; OpenTofu through DynamoDB or HTTP. The lock is essential for any team setup; the moment two people are running apply, the lock is what keeps you sane.

It can drift from reality. Someone clicks in the console and changes a security group rule. Terraform’s view of the world is now stale. The next plan reports unexpected changes; the apply either reverts the manual change or, worse, fights it. The discipline that prevents drift is “all changes go through IaC, no exceptions”. The pragmatic reality is that drift happens, and tools like terraform refresh and Pulumi’s refresh let you reconcile.

The state file is also where secrets can leak. Some resource attributes contain sensitive values: database passwords, generated tokens, TLS keys. These end up in the state file. Encrypt the state at rest. Restrict access to it. Treat the state bucket the way you treat a secrets manager.

GitOps: the workflow that ties it together

GitOps is the operational pattern that wraps IaC into something a team can use without stepping on each other. The idea is simple: the git repository is the single source of truth for infrastructure. Every change is a pull request. A controller continuously reconciles cloud state to git state. If reality drifts from git, the controller either alerts or corrects.

For Terraform, Atlantis is the canonical tool. A pull request that touches Terraform files triggers Atlantis to run terraform plan and post the output as a PR comment. Reviewers see the exact diff that will be applied. Once the PR is approved, Atlantis runs terraform apply from the PR itself. The trail is in the git history; the lock is held during apply; nobody runs Terraform from their laptop.

For Kubernetes, Flux and ArgoCD play the same role. They watch a git repo of Kubernetes manifests and continuously reconcile the cluster to match. Push a change to git, the cluster picks it up. Drift in the cluster gets corrected back toward git.

The flow looks like this:

flowchart TB
    DEV[Developer writes IaC change] --> PR[Open pull request]
    PR --> PLAN[CI runs terraform plan]
    PLAN --> REVIEW[Reviewer reads plan diff]
    REVIEW --> MERGE[Merge to main]
    MERGE --> APPLY[CI runs terraform apply]
    APPLY --> CLOUD[Cloud reaches desired state]
    CLOUD --> DRIFT[Periodic drift detection]
    DRIFT --> ALERT[Alert if drift found]

The cultural shift GitOps enforces is real. No more “I just changed it in the console”. Every change is a commit. Every commit is reviewed. Every state of the world is reachable from a git history.

When IaC is overkill

The honest list of times to skip IaC is short.

A solo project with one VM and one database. A throwaway prototype that will not survive the week. A learning exercise where the goal is to understand the cloud primitives, not to ship them.

Almost everything else benefits. The threshold is low: two engineers and three resources is enough to make manual management a worse choice than IaC. The argument that “we are too small for that” is the argument that produces snowflake infrastructure six months later.

The data-engineering angle

Data work has its own particular benefit from IaC. The infrastructure for a data platform is a lot of resources: the warehouse (with its datasets, schemas, IAM grants), the orchestrator (Airflow or Dagster or Prefect, with their secrets, connections, deployments), the compute (Spark cluster configs, Kubernetes node pools), the storage (S3 buckets with lifecycle policies and replication rules), the monitoring (Datadog dashboards, alerts, SLOs).

Provisioning a fresh staging environment that mirrors production is the use case where the investment pays the most. Without IaC, the staging environment never quite matches production, and the gap is where bugs hide. With IaC, staging is a clone of production with different variables, and the gap closes to almost zero.

The other use case is disaster recovery. If the production region goes down, IaC plus a recent backup is the path back. Click-ops plus a wiki page is not.

What this means in practice

For a team starting today, the realistic path is:

  1. Pick Terraform or OpenTofu unless there is a strong reason for Pulumi or CDK.
  2. Store state remotely from day one (S3 plus DynamoDB lock for AWS, GCS for GCP).
  3. Wrap the workflow with Atlantis or a similar PR-driven runner before two people are running Terraform from their laptops.
  4. Set up drift detection as a periodic job; investigate every drift event.
  5. Treat IaC code with the same review discipline as application code. The blast radius of a bad apply is large.

Lesson 54 covers containers, the unit most modern data jobs ship as. Lesson 55 covers Kubernetes, the runtime most container-based data platforms run on. Both layers sit on top of IaC: the cluster, the registries, the IAM roles for the jobs, all defined in Terraform or its peers.

Citations

  • Terraform documentation (https://developer.hashicorp.com/terraform/docs, retrieved 2026-05-01).
  • OpenTofu documentation (https://opentofu.org/docs/, retrieved 2026-05-01).
  • Pulumi documentation (https://www.pulumi.com/docs/, retrieved 2026-05-01).
  • AWS CDK documentation (https://docs.aws.amazon.com/cdk/v2/guide/home.html, retrieved 2026-05-01).
  • Atlantis documentation (https://www.runatlantis.io/docs/, retrieved 2026-05-01).
  • “GitOps principles” on the OpenGitOps site (https://opengitops.dev/, retrieved 2026-05-01).
Search