Containers: Docker for data jobs

A data job in 2026 almost certainly runs in a container. The Spark executor on Kubernetes, the Airflow task that does the heavy lifting, the Dagster asset materialization, the dbt run from a CI pipeline: all of them are containers, started from images, scheduled by something. Even managed platforms (Databricks, Snowflake’s Snowpark, AWS Glue) are containers underneath; you just do not see the seams.

This lesson is the container fundamentals every data engineer should know. The mental model, the Dockerfile patterns that matter, the things that go wrong on the way to a small and reproducible image, the registries where images live, and the data-specific use cases that put it all together.

Containers and what they are not

A container is not a virtual machine. The distinction matters because it shapes both the costs and the failure modes.

A virtual machine has its own kernel, its own boot process, its own memory allocation, often gigabytes of disk. The hypervisor pretends the VM is a real computer. Starting a VM takes minutes; running ten VMs on one machine costs ten times the kernel overhead.

A container shares the host kernel. It is a process (or a small group of processes) running in isolated namespaces, with a filesystem that is its own and a CPU/memory limit that is enforced by cgroups. Starting a container takes milliseconds. Running fifty containers on one machine is normal. The isolation is weaker than a VM (a kernel exploit escapes any container; it stays in the VM), but for the use case of “I want to run my code in a reproducible environment”, containers are the right primitive.

The two key concepts to keep separate are the image and the container. The image is the immutable artefact: a layered filesystem with a metadata file describing how to start a process inside it. The container is a running instance of an image, with its own writable layer on top. You build images. You run containers. A new container from the same image starts in the same clean state.

Docker is the original implementation of these ideas, dominant since 2013. The Open Container Initiative standardised the formats so that images built with Docker run anywhere that speaks OCI: containerd, podman, CRI-O, every cloud’s container service. When data engineers say “Docker image” they usually mean “OCI image”; the formats are the same.

The Dockerfile

A Dockerfile is the recipe for building an image. Each instruction creates a layer in the image’s filesystem. Layers are cached, so if you change one instruction, only layers from that point down get rebuilt.

A naive Dockerfile for a Python data job:

FROM python:3.13
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "job.py"]

This works, and it is wrong in three different ways. The base image is bigger than it needs to be. The layer ordering means a code change invalidates the dependency-install layer. There is no version pinning. We will fix all of these.

Best practices that earn their keep

Use specific, minimal base images. python:3.13-slim instead of python:latest. The slim variants drop the build tools and most of the system packages, keeping only what is needed to run a Python interpreter. The latest tag is a moving target; pin a specific version so a rebuild months from now produces the same artefact. For even smaller images, distroless or Alpine variants exist, with the trade that some Python wheels do not have prebuilt binaries for them, and you fall back to building from source.

The cost of choosing a smaller base image compounds. A 50 MB image pulls faster, scans for vulnerabilities faster, has a smaller attack surface, and costs less in registry storage. For a job that runs on a hundred Spark executors, that 50 MB is multiplied by a hundred every time the cluster spins up.

Order layers by change frequency. Things that change rarely (OS packages, system dependencies, Python interpreter) go near the top of the Dockerfile. Things that change often (your code) go near the bottom. The Docker build cache invalidates a layer and everything below it when its inputs change. If you copy your code in before installing dependencies, every code change invalidates the dependency install, and your build takes ten minutes when it could take ten seconds.

The conventional ordering for Python:

FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD ["python", "job.py"]

The requirements.txt copy is its own step; the pip install is its own step. The code copy comes after. Editing a Python file rebuilds only the last two layers, in milliseconds.

Pin dependency versions. requirements.txt with hashes, or a uv lockfile, or Poetry’s poetry.lock, or pip-tools. Whatever the tool, the lockfile is the contract: this exact tree of dependencies, every time. An unpinned requests in your requirements becomes a different version next month, and your image is suddenly different in ways nobody asked for.

Do not run as root. The default in most base images is root, and it is a footgun. A compromised process running as root inside a container has a much better chance of escaping or moving laterally than the same process running as an unprivileged user. The two extra lines:

RUN useradd --create-home --shell /bin/bash appuser
USER appuser

near the end of the Dockerfile shift the running process to a non-root user. For most data jobs this costs nothing and removes a category of risk.

Use .dockerignore. The build context is everything in the directory you ran docker build from; without a .dockerignore, a stray node_modules or .venv or data/ directory can balloon the build context to gigabytes. The file works like .gitignore. Add the obvious entries: .git, __pycache__, *.pyc, .venv, node_modules, any local data directories.

Multi-stage builds

The biggest single improvement for image size is multi-stage builds. The idea: one Dockerfile defines two (or more) images, where the later stages copy artefacts from the earlier ones, and only the final stage becomes the published image.

A typical pattern for a Python job that has a build step (compiling C extensions, generating files):

FROM python:3.13 AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt

FROM python:3.13-slim AS runtime
WORKDIR /app
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/*.whl
COPY . .
RUN useradd --create-home appuser
USER appuser
CMD ["python", "job.py"]

The builder stage has the full Python image with compilers. It builds the wheels. The runtime stage starts fresh from python:3.13-slim and only copies the wheels in. The final image has no compilers, no build tools, none of the cache. Smaller, faster to pull, less to scan.

Multi-stage shines for languages with explicit build steps: Go, Java, Rust, TypeScript that compiles to JavaScript. For pure-Python work the savings are smaller but real, because pip install from prebuilt wheels in the runtime stage is faster and lighter than full installation.

Image registries

An image has to live somewhere for the cluster to pull it. A registry is the storage and distribution layer: Docker Hub, GitHub Container Registry, GitLab Registry, AWS ECR, Google Artifact Registry, Azure ACR, JFrog, Harbor.

The choice mostly depends on where the rest of the infrastructure lives. AWS workloads pull from ECR. GCP from Artifact Registry. GitHub-centric teams use GitHub Container Registry, which authenticates with the same tokens as the rest of GitHub. Docker Hub is fine for open-source images but has rate limits that hit production clusters quickly; pin a specific cloud registry for production.

Tag strategy matters more than people realise. The latest tag is mutable; do not use it in production. Use immutable tags: a git commit SHA, a semver, a build number. The cluster pulls a specific tag, the cluster runs a specific image, the deploy is reproducible. If you ever need to roll back, you redeploy the previous tag and you know exactly what runs.

Sign your images and scan them. Cosign for signing, Trivy or Grype for scanning. The signature gives the cluster a way to refuse images that did not come from your CI; the scanner gives you a list of known vulnerabilities in your dependencies. Both should run in CI before the image is pushed.

What this means for data jobs

Docker shows up in data engineering in three main places.

Spark on Kubernetes is the cleanest example. The Spark driver and executors run as pods, started from a custom image that contains your PySpark code, your dependencies, and the Spark distribution. The image is built in CI, pushed to a registry, referenced in the SparkApplication CR or the spark-submit command. The same image runs in dev, in CI, and in prod, with different configurations. The image is the deploy unit. Lesson 55 covers the Kubernetes side in detail.

Airflow operators that run user code in containers are the second place. The KubernetesPodOperator is the canonical example: the Airflow scheduler asks Kubernetes to run a pod from a specified image, with specified arguments, and waits for it to finish. The DockerOperator does the same for plain Docker. The benefit is full isolation: the dependencies of one task do not interfere with the dependencies of another. The cost is the small per-task overhead of starting a container.

Reproducible local development is the third place, and it is the one most underused. The same image you run in production can run on a developer’s laptop, with the same Python version, the same dependencies, the same code. Bugs that only show up on production no longer happen, because the dev environment is the production environment. A docker run command that mounts the code directory and starts the job locally is the kind of small thing that changes how a team works.

The build-and-deploy cycle ties it all together:

flowchart TB
    CODE[Dockerfile plus code] --> BUILD[docker build]
    BUILD --> SCAN[Scan and sign]
    SCAN --> PUSH[Push to registry]
    PUSH --> PULL[Cluster pulls image]
    PULL --> RUN[Job runs]
    RUN --> METRICS[Metrics and logs]

A small Dockerfile checklist

Before merging a Dockerfile, the things to look for:

Specific, slim base image with a pinned version.
Multi-stage build if there are compile steps.
Layers ordered from rare-changes to often-changes.
Dependency lockfile copied before code, installed before code is copied.
Non-root user for the running process.
.dockerignore excluding the obvious junk.
Immutable tag in the published name.
Health check or known-good entrypoint.
No secrets baked into the image (env vars at runtime, not build args).

Each item is small. Together they are the difference between an image that is forty megabytes and pulls in a second and an image that is two gigabytes and takes a minute, multiplied by however many copies your cluster runs.

What this means in practice

Containers are the deploy unit. The Dockerfile is the contract between the code you write and the runtime that runs it. A small, well-formed Dockerfile is one of the highest-leverage pieces of code in a data team’s repo: every job inherits its properties.

The good news is that the patterns are stable. A Dockerfile written today for a Python job will keep working in five years; the cloud underneath will change, the orchestrator will change, the registries will change, but the Dockerfile and the OCI image stay the same. Of all the layers in the modern data stack, this is one of the most boring, in the best sense.

Lesson 55 builds on this with Kubernetes, the runtime most teams now run their containers on. Lesson 56 closes the module with Stripe’s deployment story, which threads CI, CD, IaC, and containers into one operational picture.

Citations

Docker documentation, “Best practices for writing Dockerfiles” (https://docs.docker.com/build/building/best-practices/, retrieved 2026-05-01).
Open Container Initiative specifications (https://opencontainers.org/, retrieved 2026-05-01).
“Multi-stage builds” in the Docker documentation (https://docs.docker.com/build/building/multi-stage/, retrieved 2026-05-01).
Spark on Kubernetes documentation (https://spark.apache.org/docs/latest/running-on-kubernetes.html, retrieved 2026-05-01).
Cosign project (https://github.com/sigstore/cosign, retrieved 2026-05-01).