Kubernetes for data: the good, the bad, the necessary

Module 7 has been about the practices that turn architecture into running systems. Branching strategies, trunk-based development, CI, CD, infrastructure as code: each lesson added a layer to the picture of how a team actually delivers software. This lesson adds the runtime layer that most modern data platforms now live on, whether the team chose it deliberately or inherited it from the rest of the company. Kubernetes.

The framing matters. Kubernetes is not a data tool. It is a general-purpose distributed orchestrator that schedules containers across a fleet of machines. Data engineers end up using it because the rest of the organisation already does, because the cloud-native ecosystem has converged on it, and because the alternatives (manage your own VMs, glue together cloud-specific services, run on a managed Spark platform like Databricks or EMR) have their own costs. The right question is rarely “should we use Kubernetes”. The right question is “given that Kubernetes is the substrate, how do we run data workloads on it well, and what are we signing up for”.

This lesson walks through that question in three parts: what Kubernetes is good at, what it is bad at, and what makes it the necessary choice for most mid-to-large data teams. The middle section covers the operator pattern and the two integrations that matter most in practice, Spark on Kubernetes and Airflow on Kubernetes.

What Kubernetes actually is

Strip away the marketing and Kubernetes is a few primitives put together carefully.

A pod is one or more containers scheduled together on the same node. Pods are the unit of deployment. They share networking and a small amount of local storage.

A node is a machine (a VM, usually) that runs pods. A cluster is a set of nodes plus a control plane that decides where pods get scheduled, restarts them when they fail, and exposes the APIs the rest of the system talks to.

A deployment is a desired-state declaration: “I want three replicas of this pod, with this image, this CPU and memory allocation, this environment variable”. You hand the deployment to the cluster. The cluster’s job is to make reality match the declaration. If a pod crashes, the cluster starts a new one. If you change the image, the cluster rolls out the new version while keeping the old one alive long enough to avoid downtime.

A service is a stable network address in front of a set of pods. Pods come and go; the service stays. Clients talk to the service, the service load-balances across the pods.

The model is declarative all the way down. You describe what you want; the cluster reconciles toward it. That reconciliation loop is the core of Kubernetes, and it is also the core of the operator pattern, which is where data tooling enters the picture.

The good

Standardisation across environments. A Kubernetes cluster on AWS (EKS), Google Cloud (GKE), Azure (AKS), or on-premises looks the same from the deployment side. The same YAML works in all of them, with a small layer of cloud-specific differences in load balancers, storage classes, and IAM. Teams that have lived through a multi-cloud or cloud-to-on-prem migration know how rare this kind of portability is.

Auto-scaling. Kubernetes can scale pods based on CPU, memory, or custom metrics through the Horizontal Pod Autoscaler. The cluster itself can scale by adding nodes through the Cluster Autoscaler or Karpenter. For bursty data workloads (a Spark job that needs two hundred executors for ten minutes and then nothing) this elasticity is the operational story that makes per-job clusters affordable.

Self-healing. A pod that crashes gets rescheduled automatically. A node that becomes unreachable has its pods evicted and rescheduled elsewhere. The system absorbs a class of failures that, in a hand-rolled VM setup, would page a human at three in the morning.

The operator pattern. Kubernetes lets you define your own custom resources and write a controller (an operator) that watches them and reconciles toward the desired state. This is what turns Kubernetes from a container scheduler into a platform for building higher-level abstractions. A SparkApplication resource, a PostgresCluster resource, a KafkaTopic resource: each is a custom kind that an operator knows how to materialise into pods, services, and config. From the user’s perspective, you create a SparkApplication; the operator handles the driver pod, the executor pods, the service accounts, the logging configuration, the cleanup on completion.

Ecosystem. Every modern infrastructure tool ships with a Kubernetes integration, usually a Helm chart. Prometheus, Grafana, Jaeger, ArgoCD, Flux, Vault, cert-manager, external-dns: the standard cloud-native stack composes on top of Kubernetes. A team that adopts Kubernetes inherits a mature toolchain by default.

The bad

Operational complexity. Running a healthy Kubernetes cluster is a job. The control plane has its own database (etcd), its own networking layer, its own certificate rotation, its own upgrade story. Even teams on managed offerings (EKS, GKE, AKS) spend serious time on cluster operations: node group sizing, version upgrades, network plugin choices, IAM integration. A data team that shares a platform team with the rest of the company is fine. A data team running its own dedicated cluster needs to budget for the operations cost.

Networking. Pod-to-pod, pod-to-service, ingress, egress, network policies, service mesh. Each is its own subject. Debugging a connection that should work but does not means understanding which layer is responsible: the CNI plugin, the service definition, the network policy, the ingress controller, the cloud load balancer, the DNS resolution. Most data engineers learn the parts they hit and avoid the rest.

YAML proliferation. A single application might require a deployment, a service, a config map, a secret, an ingress, a horizontal pod autoscaler, a service account, and a role binding. That is seven YAML files for a hello-world app. Helm charts and Kustomize exist to manage the proliferation, and they help, but the underlying truth is that Kubernetes is verbose. Teams that write their own raw YAML drown. Teams that adopt a templating tool early stay afloat.

Resource accounting. Each pod declares CPU and memory requests (the amount the scheduler reserves) and limits (the maximum the pod can use before being throttled or killed). Setting these correctly is a discipline. Set requests too low and pods get scheduled onto overloaded nodes and become noisy neighbours. Set them too high and the cluster runs out of room while nodes sit idle. Set limits too low and the kernel kills your Spark executor mid-shuffle with an out-of-memory error that, from the application’s perspective, looks like a node failure. The right values come from observation and iteration; the defaults are almost always wrong.

The necessary: data tooling on Kubernetes

The reason most data teams end up on Kubernetes is not enthusiasm for the platform. It is that the data tools they want to run have first-class Kubernetes integrations, and those integrations are now better than the alternatives.

Spark on Kubernetes. The Kubernetes Operator for Spark (originally from Google, now community-maintained) is the modern way to run Spark. Each Spark job becomes a SparkApplication custom resource. The operator creates a driver pod, the driver requests executor pods from the cluster, the executors run the work, and on completion everything is cleaned up. The job ran on an ephemeral cluster of pods, the pods are gone, and the cluster reclaims the capacity. This is the “per-job cluster” model that was awkward to achieve with YARN and is now the default with Kubernetes.

The advantages compound. Resource isolation is per-pod. Different Spark versions can run side by side. The same observability stack that watches the rest of the cluster watches the Spark jobs. Cost attribution is per-namespace. Failures are pod failures, with the same restart and rescheduling story as everything else.

Airflow on Kubernetes. Airflow has two relevant integrations. The KubernetesExecutor runs each task in its own pod, instead of in a worker process inside a long-lived Celery worker. This gives per-task resource isolation, per-task scaling, and per-task image flexibility (a task can specify the container image it needs, which decouples Airflow upgrades from task code upgrades). The KubernetesPodOperator is more general: it is an Airflow operator that launches an arbitrary container, which lets a DAG orchestrate any tool with a Docker image, not just things with native Airflow operators.

The combination is powerful. An Airflow DAG running on the KubernetesExecutor can use the KubernetesPodOperator to launch a Spark job (via the Spark Operator), a dbt run, a Python script, and a Flink job, each in its own container with its own resource budget, all orchestrated as a normal Airflow workflow.

Argo Workflows. A Kubernetes-native workflow engine, an alternative to Airflow for teams that want their orchestrator to be as native to Kubernetes as the rest of their stack. Each step in an Argo workflow is a pod. The DAG is a custom resource. The whole thing reconciles like everything else in the cluster. Argo is the choice for teams that find Airflow’s Python-first model unnecessary and want their workflows defined in YAML alongside their other Kubernetes resources.

Flink on Kubernetes. Same pattern as Spark. The Flink Kubernetes operator manages FlinkDeployment resources. Each deployment is a Flink cluster of pods. Long-running streaming jobs map onto long-lived deployments; ad-hoc batch jobs map onto short-lived ones.

The pattern is consistent. The data tools have agreed that Kubernetes is the substrate, and they have built operators that turn Kubernetes-native primitives into the workload abstractions data engineers care about. The architectural win is not Kubernetes itself; it is that the tools have converged on a common runtime.

A typical data Kubernetes cluster

flowchart TB
    subgraph cluster[Kubernetes cluster]
        subgraph ns_airflow[Namespace: airflow]
            AF_WEB[Webserver]
            AF_SCHED[Scheduler]
            AF_TASK[Task pods<br/>per DAG task]
        end
        subgraph ns_spark[Namespace: spark-operator]
            SO[Spark Operator]
            SD[Driver pods]
            SE[Executor pods]
        end
        subgraph ns_mon[Namespace: monitoring]
            PROM[Prometheus]
            GRAF[Grafana]
            LOKI[Loki]
        end
        subgraph ns_jobs[Namespace: data-jobs]
            DBT[dbt run pods]
            PY[Python job pods]
        end
    end
    AF_SCHED --> AF_TASK
    AF_TASK --> SO
    AF_TASK --> DBT
    AF_TASK --> PY
    SO --> SD
    SD --> SE
    PROM -.scrapes.-> AF_SCHED
    PROM -.scrapes.-> SD
    PROM -.scrapes.-> SE
    GRAF --> PROM

Diagram to create: a polished version of the cluster layout above, with four namespaces clearly delimited, the control flow arrows from Airflow to the workload pods, the Spark Operator materialising driver and executor pods, and the monitoring namespace scraping metrics from everything. The visual point is that namespaces are the unit of separation and that the operator pattern is what makes the workload pods appear and disappear.

The shape generalises. Most data Kubernetes clusters look like this: a few namespaces for the platform components (Airflow, Spark Operator, monitoring, sometimes Argo or Flink), one or more namespaces for the actual workloads, and a control flow that goes from Airflow (or Argo) through the operators, which in turn create the workload pods. The platform team owns the platform namespaces. The data engineers work in the workload namespaces.

When NOT to use Kubernetes

Three cases where Kubernetes is the wrong answer.

Single-VM workloads. A small ETL job that runs once a day on a single machine does not need an orchestrator that schedules containers across a fleet. A cron entry on a VM, or a serverless function, is simpler and cheaper. Kubernetes adds operational complexity that pays off only when you have enough workloads and enough scale to amortise it.

Tiny teams without dedicated infra. A two-person data team with no platform support is going to spend a third of its time on cluster operations if it runs its own Kubernetes. The same team on a managed Spark platform (Databricks, Snowflake, BigQuery with Dataform) trades cost-per-job for almost zero operations work. That trade is usually correct at small scale.

Workloads with serverless equivalents. AWS Lambda, Google Cloud Run, AWS Batch, and Azure Container Instances exist for a reason. A workload that fits within their constraints (size, runtime, cold-start tolerance) is often easier to run on them than on Kubernetes. The architectural rule of thumb: Kubernetes is for workloads that benefit from a cluster; serverless is for workloads that do not.

The honest summary

Most data teams in mid-to-large companies end up on Kubernetes whether they want to or not. The rest of the company runs there. The platform team supports it. The cloud-native ecosystem assumes it. The data tools have first-class integrations for it. Fighting that current is rarely worth it.

The right move, given that, is to lean into the standard tools rather than reinvent. Use the Spark Operator instead of writing your own Spark-on-k8s submission glue. Use the KubernetesExecutor for Airflow instead of the CeleryExecutor. Use a managed offering (EKS, GKE, AKS) instead of running your own control plane. Adopt a templating tool (Helm, Kustomize) early. Get the resource requests and limits right, and revisit them when the workload changes.

The goal is the same goal Module 7 has been building toward all along: make the right thing the easy thing. Kubernetes adds a steep up-front cost and pays back a long flat tail of operational benefits. For most data teams at scale, the trade is worth it; for small teams, it usually is not. Knowing which side of that line you are on is the architectural decision the rest of the lesson has been preparing for.

The next lesson closes Module 7 with a case study: Stripe’s deployment pipeline, and what their published engineering practices reveal about CI/CD when the system handles real money at scale. Module 8 will start with a deeper look at orchestration, with Airflow as the running example and Kubernetes as the substrate underneath.