You write a Python script that pulls from an API, transforms with pandas, loads to Postgres. It works. You schedule it with cron. For a few weeks everything is fine. Then a job fails at 3am and you find out Tuesday afternoon, because cron doesn’t tell you. Then a job depends on another job, and cron doesn’t know about dependencies, so you cross your fingers about timing. Then you have ten jobs and you can’t remember what runs when. Then someone asks “did Tuesday’s load actually complete?” and you have nothing to show them.
That’s the moment you need an orchestrator. This lesson walks the 2026 landscape — Airflow, Prefect, Dagster, the managed-vs-self-hosted question — and ends with a simple decision tree.
What an orchestrator actually does
Cron does one thing: run a command at a time. Everything else, you build yourself, badly. An orchestrator does the rest:
- Scheduling with calendars, intervals, cron expressions, event triggers.
- Dependencies: “run B only after A succeeds.”
- Retries: “if it fails, try again 3 times with exponential backoff.”
- State and history: every run logged, every result inspectable.
- Observability: a UI showing what ran, what failed, how long it took, with logs you can click into.
- Alerting: tell me on Slack when something fails.
- Backfills: “re-run this pipeline for every day in March.”
- Resource management: pools, queues, concurrency limits per task.
You can build all of that on top of cron and bash and a Postgres metadata table. People do, every year, and every year they regret it. The orchestrator is the thing that exists so you don’t have to build that.
When you need one
The honest threshold: somewhere around 5-10 jobs with dependencies, or a single job important enough that you need to know within minutes when it breaks. Below that, cron + a script that emails you on failure is enough. Above that, the home-grown approach starts to bite.
Signs you’ve crossed the line:
- You can’t tell at a glance which jobs ran today.
- You’re chaining jobs with
&&in cron and praying about runtime. - Backfilling a date range means writing one-off scripts.
- Failures are noticed by humans, not systems.
- “Did this run?” is a hard question to answer.
If two of those are true, you need an orchestrator.
Airflow — the elder statesman
Airflow came out of Airbnb in 2014. It’s the dominant open-source data orchestrator, full stop. If you join an established data team, the odds are heavy you’ll meet Airflow in week one.
The core concept is the DAG — directed acyclic graph — a Python file that declares tasks and their dependencies:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
def extract():
# pull from API, write to S3
...
def transform():
# read S3, transform, write back
...
def load():
# read S3, load to Postgres
...
with DAG(
dag_id="daily_etl",
start_date=datetime(2026, 1, 1),
schedule="0 3 * * *", # 3am daily
catchup=False,
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
e = PythonOperator(task_id="extract", python_callable=extract)
t = PythonOperator(task_id="transform", python_callable=transform)
l = PythonOperator(task_id="load", python_callable=load)
e >> t >> l
The >> operator declares the dependency: extract before transform before load. Airflow parses the DAG, schedules each task, runs them in order, retries on failure, and shows you everything in a web UI.
Pros: enormous ecosystem. Every cloud provider, database, SaaS tool, and message broker you’ve ever heard of has an Airflow operator or hook. Battle-tested at huge scale. Plenty of jobs hire for it. Astronomer’s managed offering removes most of the operational pain.
Cons: heavy install. The metadata DB, scheduler, webserver, executor, workers — even the minimal setup is several moving parts. The “DAG file is Python code” model means orchestration logic and business logic tend to mush together. The UI improved in Airflow 2 and 3 but still feels last-decade. Dynamic DAGs — where the structure depends on runtime data — were retrofitted and feel it.
Use Airflow when: you’re at a place that already uses it, or you need the breadth of integrations, or you want the safest hiring story.
Prefect — the pythonic alternative
Prefect launched in 2018 with a simple pitch: take what Airflow got right (scheduling, retries, observability) and rewrite it for people who actually like Python.
from prefect import flow, task
import httpx
import pandas as pd
@task(retries=3, retry_delay_seconds=60)
def extract(url: str) -> dict:
return httpx.get(url).json()
@task
def transform(data: dict) -> pd.DataFrame:
return pd.DataFrame(data["records"])
@task
def load(df: pd.DataFrame) -> None:
df.to_sql("records", "postgresql://...", if_exists="append")
@flow(name="daily_etl")
def daily_etl(url: str):
raw = extract(url)
clean = transform(raw)
load(clean)
if __name__ == "__main__":
daily_etl.serve(name="daily", cron="0 3 * * *")
Notice the difference. Tasks are decorated functions. The flow is a regular Python function that calls them — and the dependency graph is whatever the calls imply. If extract returns a value that transform consumes, Prefect knows transform depends on extract. No >> operators, no set_upstream calls. It just reads.
Pros: cleaner API, dynamic DAGs are first-class, the UI is genuinely modern, the local development loop is fast. Prefect Cloud handles the scheduler so you only run workers. Easier to learn if you’re coming in fresh.
Cons: smaller ecosystem than Airflow. Fewer pre-built integrations means more “just write a function” — which is fine, until you wanted a battle-tested Snowflake operator. Less common in job postings, though that gap is closing.
Use Prefect when: you’re starting fresh, you want a friendlier API, you don’t need a specific Airflow integration, your team values developer experience.
Dagster — the data-aware contender
Dagster (2019+) takes a different angle. Airflow and Prefect orchestrate tasks. Dagster orchestrates assets — the tables, files, models, and reports your tasks produce.
from dagster import asset, Definitions
import httpx
import pandas as pd
@asset
def raw_records() -> dict:
return httpx.get("https://api.example.com/data").json()
@asset
def clean_records(raw_records: dict) -> pd.DataFrame:
return pd.DataFrame(raw_records["records"])
@asset
def loaded_records(clean_records: pd.DataFrame) -> None:
clean_records.to_sql("records", "postgresql://...", if_exists="append")
defs = Definitions(assets=[raw_records, clean_records, loaded_records])
The function arguments are the dependencies. clean_records takes raw_records as an argument, so Dagster knows to materialize raw_records first. The unit of work isn’t “run this task” — it’s “make sure this asset is fresh.” Run a job, and Dagster figures out the minimal set of assets to rebuild.
This shift sounds small but reframes everything. You get lineage for free — Dagster knows that loaded_records depends on clean_records depends on raw_records, and shows it in the UI as a graph. You get typing — assets have types, and Dagster checks them. You get partial runs — “rebuild just the downstream of this one asset” is a built-in. You can express partitions (one asset per day, per region, per tenant) as a first-class concept rather than a runtime hack.
Pros: data-native model, software-engineering-grade types and tests, lineage out of the box, integrates beautifully with dbt and Spark, Dagster Cloud is a clean managed offering.
Cons: opinionated. The asset model is a different way to think, and if your work doesn’t fit it (long-running ML training, message-driven event handlers, anything that doesn’t really produce a “table”) it can feel forced. Smaller community than Airflow, and fewer hireable engineers know it.
Use Dagster when: your work is fundamentally about producing and refreshing data assets, you care about lineage and types, your team is on board with the model.
Self-hosted vs managed
For all three, the managed offering removes most of the operational pain. In rough 2026 terms:
- Astronomer: managed Airflow. The default if you’ve picked Airflow and don’t want to run it yourself.
- Prefect Cloud: managed Prefect scheduler. You run workers (often serverless) and Prefect runs the brain.
- Dagster Cloud: managed Dagster, with hybrid (your code, their control plane) and serverless options.
- Databricks Workflows / Snowflake Tasks: built-in orchestrators inside data platforms. Worth it if you live there entirely; limiting if you don’t.
- Cloud-native: AWS Step Functions, GCP Cloud Composer (which is just managed Airflow), Azure Data Factory. Worth knowing they exist; usually not the best Python developer experience.
Self-hosting gives you control, costs money in engineering hours, and is fine if you have a platform team. Managed gives you back those hours at the cost of some monthly USD. For a team smaller than ten, managed almost always wins.
The decision tree
A short version that’s served me well:
- A few jobs, no dependencies, a friendly team that reads emails. Cron + scripts that email on failure. Don’t over-engineer.
- Tens of jobs, real dependencies, you’ve been bitten by silent failures. Pick a managed orchestrator. If you have free choice and a small team: Prefect for ease, Dagster if your work is asset-shaped, Airflow if you want the safest hiring story.
- Hundreds of jobs, complex dependencies, multiple teams, regulatory needs. Self-host one of the three based on team skills. Airflow wins on integrations and hiring; Dagster wins on data semantics; Prefect wins on operational simplicity.
- You’re already on a platform. If your data lives in Databricks, use Workflows. If it lives in Snowflake, Tasks plus dbt is often enough. Don’t add an orchestrator just because you can.
The honest meta-point: the orchestrator is rarely the most important decision. Pipeline quality — idempotency, observability, testing, sensible boundaries between extract / transform / load — matters far more than the tool wrapping it. A clean pipeline on Airflow beats a messy one on Dagster, every time.
Where we go next
In the next and final lesson of this module, we build a real pipeline end to end. Public API, transform with pandas, load to a database, scheduled, monitored, idempotent. We’ll start with cron because it’s the right answer for the size, and we’ll point at where you’d graduate to one of the orchestrators above.
The orchestrator world is bigger than this lesson — Airbyte, Mage, Argo Workflows, Temporal, Kestra, Windmill all deserve mention and don’t get one. The three I’ve covered are the data-engineering mainstream in 2026. Pick one, learn it well, and you’ll recognize the shape of the rest.
Citations: Airflow docs, Prefect docs, Dagster docs. Retrieved 2026-05-01.