Batch processing fundamentals: Hadoop's lessons

Lesson 33 ended with the observation that modern cloud warehouses make batch processing invisible: you write SQL, and somewhere underneath, a distributed engine reads files from object storage and produces the answer. The invisible engine is not magic. It is the descendant of a specific paper Google published in 2004, which an open-source community then spent fifteen years catching up to, then surpassing, then quietly retiring. This lesson walks the rise and fall of Hadoop, because the parts of it that survived are still the load-bearing ideas underneath every batch system you will use.

The reason to spend a lesson on technology that nobody is enthusiastic about deploying in 2026 is that the technology was right in ways the marketing has obscured, and wrong in ways that are equally instructive. Modern batch engines did not replace Hadoop’s ideas; they replaced its implementation.

The 2003 and 2004 papers

In October 2003, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung published “The Google File System” at SOSP. The paper described a distributed filesystem that ran on cheap commodity machines, assumed disks and networks would fail constantly, and was optimised for the access pattern Google’s batch workloads actually had: large sequential reads and large appends, with very few small random writes. GFS was not a general-purpose filesystem. It was a filesystem shaped to a specific shape of work, and the shape mattered more than the elegance.

In December 2004, Jeffrey Dean and Sanjay Ghemawat followed with “MapReduce: Simplified Data Processing on Large Clusters” at OSDI. The paper described a programming model and a runtime: write a map function that processes one record at a time and a reduce function that aggregates records by a key, and the runtime would distribute the computation across thousands of machines, handle failures, restart dead tasks, and produce a result. The pair of papers (storage layer plus execution layer) defined a complete system for running batch jobs at the scale Google was already running them internally.

Doug Cutting and Mike Cafarella, working on the Nutch search engine, started a clean-room reimplementation of GFS and MapReduce in Java in 2005. By 2006 the project moved to Yahoo, was renamed Hadoop, and became the open-source standard for distributed batch processing. Yahoo, Facebook, LinkedIn, Twitter, and most of the internet companies that grew up around 2008 to 2012 ran on Hadoop clusters at one point. By 2012, “big data” and “Hadoop” were near-synonyms in the trade press, and there were three commercial distributions (Cloudera, Hortonworks, MapR) competing for the enterprise market.

That is the story arc. The technical content underneath is what matters.

The MapReduce model

A MapReduce job has three phases. Map runs a user-supplied function on each input record independently, in parallel, on many machines. The function emits zero or more key-value pairs. Shuffle is the runtime’s job: redistribute all the emitted pairs across the cluster so that all pairs sharing a key end up on the same machine. Reduce runs another user-supplied function once per key, with all the values for that key as input, and emits the final output.

The classic example is word count. The map function reads each line of a document and emits (word, 1) for every word it finds. The shuffle groups all the (word, 1) pairs by word. The reduce function receives (word, [1, 1, 1, ...]) and emits (word, total). The whole computation is two short user functions plus a runtime that handles distribution.

The model is simple enough to fit on a slide and powerful enough to express most aggregation, joining, and filtering work. SQL queries can be compiled into MapReduce jobs (Hive did this from 2008 onwards). Graph algorithms can be expressed as iterated MapReduce passes. ETL jobs map naturally onto it.

flowchart LR
    I[(Input split<br/>on GFS/HDFS)] --> M1[Map task 1]
    I --> M2[Map task 2]
    I --> M3[Map task 3]
    M1 --> SH[Shuffle<br/>partition by key]
    M2 --> SH
    M3 --> SH
    SH --> R1[Reduce task 1]
    SH --> R2[Reduce task 2]
    R1 --> O[(Output<br/>on GFS/HDFS)]
    R2 --> O

What MapReduce got right

Four ideas survived from the MapReduce era and are now load-bearing assumptions of every modern batch engine.

Bring code to data, not data to code. The map tasks run on the same machines that hold the input data, or as close as the network topology allows. The framework reads the data layout, schedules each task on a node that has the corresponding block locally, and only falls back to a remote read if no local node is available. At the scale of petabytes, moving the code (a few kilobytes of compiled JAR) to the data (terabytes per node) is dramatically cheaper than moving the data to the code. This principle, called data locality, is how every distributed batch system schedules work in 2026.

Fault tolerance via replication and re-execution. A thousand-machine cluster has at least one machine failing at any given moment. MapReduce assumed this and built failure handling into the runtime: input data is replicated across multiple nodes (three replicas in default HDFS), and if a task dies the master schedules it to run again on a different node. Job authors do not write retry logic; the framework retries on their behalf. The model only works because map and reduce functions are pure: re-running them produces the same output. This constraint is the price of cheap fault tolerance, and it is a price every batch framework still pays.

Schema-on-read. Data lands on HDFS in whatever shape it arrived in: log lines, JSON blobs, custom binary records. The schema is not enforced by the storage layer; it is interpreted by the job that reads the data. This is the opposite of the warehouse model lesson 33 contrasted ETL with: in a Hadoop world the raw is durable and authoritative, and structure is applied by code. The pattern persists in every “lake” and “lakehouse” architecture today.

Commodity hardware at scale. Before MapReduce, scaling analytical workloads meant buying a bigger machine: a Teradata appliance, a high-end Oracle box, an IBM mainframe partition. MapReduce demonstrated that a thousand cheap machines, each individually unreliable, could run faster and cheaper than one expensive one if you wrote the software to assume failure. Every cloud-scale data system is built on the same assumption now.

What MapReduce got wrong

Three weaknesses, in increasing order of severity.

Disk-heavy intermediate state. Between the map phase and the reduce phase, MapReduce wrote the entire shuffle output to disk. Every map task wrote its output to local disk, the shuffle service read it back over the network, and the reduce task wrote its results to HDFS. For a job with a single map and reduce, that was acceptable. For an iterative algorithm (machine-learning training, graph algorithms like PageRank, anything that loops over the same data multiple times), every iteration was a fresh round-trip through disk. Spark would later show that the same iterative jobs could run a hundred times faster by keeping intermediate state in memory.

Verbose, hard-to-debug APIs. Writing a Hadoop job in Java meant writing two classes (a Mapper and a Reducer), wiring them up in a driver class, configuring serialisation, and submitting the JAR. A SQL-equivalent of select count(*) from t group by x was sixty lines of boilerplate. Hive and Pig wrapped this in higher-level languages, but the underlying API was painful enough that “MapReduce expertise” was a job title for several years. Debugging a job that failed on machine 487 of 1000 with a stack trace involving Java serialisation was its own special kind of suffering.

Operational complexity. The Hadoop ecosystem grew explosively between 2008 and 2014, and the result was a stack that took a small team to operate. A typical cluster ran HDFS for storage, MapReduce for batch, YARN for resource scheduling, Hive for SQL, HBase for low-latency reads, ZooKeeper for coordination, Oozie for job scheduling, Sqoop for relational ingest, Flume for log ingest, Kafka for streaming (eventually), Ranger for access control, and Kerberos for authentication. Every component had its own configuration, its own failure modes, its own metrics, its own daemons to keep alive. The commercial Hadoop distributions made money by employing the people who could actually keep all of this running.

The combination of the three weaknesses is why “Hadoop” as a brand has effectively retired. Cloudera and Hortonworks merged in 2019 to consolidate a shrinking market, then went private in 2021. MapR sold its IP to HPE in 2019. The cloud providers offered managed object storage that was cheaper and easier than running HDFS, managed Spark that was faster than MapReduce, and managed warehouses that hid the cluster entirely.

What survived

The implementation faded. The ideas did not.

Distributed scheduling and data-locality awareness. Every modern batch engine, from Spark to Trino to BigQuery’s internal Dremel runtime, schedules tasks with awareness of where the data physically lives. The Hadoop YARN scheduler is gone; the principle it enforced is in Kubernetes scheduling extensions, in Spark’s task planner, in every serverless query engine.

Fault-tolerant batch as a primitive. The assumption that any task can die and the framework will re-run it is built into Spark, into Beam, into Flink’s batch mode, into every batch warehouse. The cost (functional purity) and the benefit (cheap reliability at scale) carried over wholesale.

Columnar storage formats. Apache Parquet (originated at Twitter and Cloudera, 2013) and Apache ORC (Hortonworks, 2013) emerged from the Hadoop community as columnar on-disk formats designed for the analytical access patterns MapReduce had popularised. They are now the default storage format for analytical data outside the warehouse, including for every modern lakehouse table format that builds on them (Delta, Iceberg, Hudi, all of which are in lesson 37). HDFS faded; the file formats it incubated are everywhere.

Object storage as the data lake. HDFS was a distributed filesystem on cluster-attached disks. Once Amazon S3 (2006), Google Cloud Storage (2010), and Azure Blob Storage (2010) offered durable, cheap, internet-scale storage with a similar enough API, the case for running your own HDFS cluster collapsed. The “data lake” of 2026 is a bucket on S3 or GCS or Azure Blob, with Parquet files in folder hierarchies, queried by Spark, Trino, DuckDB, or the warehouse engine. The shape is exactly what HDFS plus Hive looked like in 2014, and the operational cost is one connection string instead of a six-person team.

MapReduce as a primitive inside other engines. Spark’s RDD execution model is a generalisation of MapReduce: tasks are scheduled on partitions, shuffles redistribute by key, fault tolerance is by re-execution. The user does not write map and reduce directly any more (they write DataFrame operations or SQL), but the runtime underneath is doing what MapReduce did, with better optimisation and in-memory intermediate state.

The shape of modern batch

Pulling the surviving ideas together, the modern batch architecture has the same shape MapReduce had, with replaced parts.

The storage layer is object storage (S3, GCS, Azure Blob), holding immutable files in a columnar format (Parquet, sometimes ORC), often organised by an open table format (Delta, Iceberg, Hudi) that adds transactions and versioning on top.

The compute layer is decoupled from storage and scales independently. Spark, Trino, DuckDB, and the proprietary engines inside cloud warehouses all read the same files. You can run multiple compute engines against the same lake at the same time, which would have been a configuration nightmare in the Hadoop era.

The execution model is still distributed-task scheduling with data locality, fault tolerance via re-execution, and a shuffle phase between stages. The user-facing API is SQL or DataFrames, not raw map and reduce functions, but the runtime underneath is doing the same work the 2004 paper described.

The operational model is dramatically simpler. There is no cluster to keep alive between jobs in the warehouse case; there is one Kubernetes operator and one Helm chart in the Spark-on-Kubernetes case. The thousand-line core-site.xml is gone.

Where this leads

The dominant general-purpose batch engine of the modern era, the one that bridged the Hadoop world to the cloud-native one, is Apache Spark. Spark is what most teams reach for when they need batch processing that does not fit into a warehouse SQL query, and it is the engine underneath Databricks, the largest commercial platform built on the lessons of MapReduce. Lesson 35 covers Spark, the rest of the modern batch stack of 2026, and the decision tree of when each tool earns its place.

Citations and further reading

Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, “The Google File System”, SOSP 2003, https://research.google/pubs/the-google-file-system/ (retrieved 2026-05-01).
Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI 2004, https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/ (retrieved 2026-05-01). The original paper. Short and clear; worth reading even today.
Apache Hadoop documentation, https://hadoop.apache.org/docs/stable/ (retrieved 2026-05-01).
Apache Parquet documentation, https://parquet.apache.org/docs/ (retrieved 2026-05-01).
“Hadoop: The Definitive Guide” (Tom White, O’Reilly, 4th edition, 2015). The standard reference for the Hadoop ecosystem at its peak; useful for historical context even if you never plan to run a cluster.
“Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapter 10. The cleanest modern summary of the MapReduce model and its limitations.