Reading data: CSV, JSON, Parquet, and the schema-on-read tradeoff

A Spark job that doesn’t read any data is a Spark job that doesn’t do any work. Today we put data into Spark from the three formats you’ll meet in 95% of real jobs: CSV, JSON, and Parquet. Each one has different defaults, different gotchas, and a very different cost-per-row. Picking the right format and the right read options is the difference between a 10-second job and a 10-minute job.

We’ll also touch on the schema-on-read tradeoff: do you let Spark guess the column types, or do you tell it? The wrong choice on a 200GB dataset will burn an hour of your morning before you notice.

Setup

Same SparkSession as last lesson — assume spark is already alive:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
         .appName("ReadingData")
         .master("local[*]")
         .config("spark.sql.shuffle.partitions", "8")
         .getOrCreate())

spark.sparkContext.setLogLevel("WARN")

We need some sample data. Save this snippet and run it once to generate three little files:

import json
from pathlib import Path

data_dir = Path("./data")
data_dir.mkdir(exist_ok=True)

# CSV
(data_dir / "orders.csv").write_text(
    "OrderId,CustomerId,Total,Country,OrderDate\n"
    "1001,1,59.00,NL,2026-03-05\n"
    "1002,1,29.00,NL,2026-03-18\n"
    "1003,2,149.00,IT,2026-02-15\n"
    "1004,2,89.50,IT,2026-03-22\n"
    "1005,3,199.00,DE,2026-03-10\n"
    "1006,4,42.42,RO,2026-03-28\n"
)

# JSON Lines (one record per line — the format Spark prefers)
(data_dir / "customers.jsonl").write_text(
    '{"CustomerId": 1, "Name": "Anna",  "Country": "NL"}\n'
    '{"CustomerId": 2, "Name": "Marco", "Country": "IT"}\n'
    '{"CustomerId": 3, "Name": "Dieter","Country": "DE"}\n'
    '{"CustomerId": 4, "Name": "Ioana", "Country": "RO"}\n'
)

# Multi-line JSON (one big array — spans multiple lines per "record")
(data_dir / "customers_pretty.json").write_text(json.dumps([
    {"CustomerId": 1, "Name": "Anna",   "Country": "NL"},
    {"CustomerId": 2, "Name": "Marco",  "Country": "IT"},
    {"CustomerId": 3, "Name": "Dieter", "Country": "DE"},
    {"CustomerId": 4, "Name": "Ioana",  "Country": "RO"},
], indent=2))

Now we have something to read.

Two ways to spell the same read

PySpark gives you two builder forms. They do exactly the same thing.

# Shorthand
df = spark.read.csv("./data/orders.csv", header=True)

# Builder
df = (spark.read
      .format("csv")
      .option("header", "true")
      .load("./data/orders.csv"))

Use the shorthand for one-off reads. Use the builder when you have many options or when the format is parameterised (e.g. read from a config that says csv today and parquet tomorrow).

CSV: the format that lies about being simple

df = spark.read.csv("./data/orders.csv")
df.show()
df.printSchema()

Run that and watch:

+----+----------+------+-------+----------+
| _c0|       _c1|   _c2|    _c3|       _c4|
+----+----------+------+-------+----------+
|  Id|CustomerId| Total|Country| OrderDate|
|1001|         1| 59.00|     NL|2026-03-05|
|1002|         1| 29.00|     NL|2026-03-18|
...

Two problems already. The header row was treated as data, and every column is string. Spark’s CSV defaults are pessimistic — without you saying otherwise, it assumes:

No header row.
Every column is a string.
Comma is the delimiter.
Double-quote is the quote character.

You almost always need to override at least the first two:

df = (spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .csv("./data/orders.csv"))

df.show()
df.printSchema()

+-------+----------+-----+-------+----------+
|OrderId|CustomerId|Total|Country| OrderDate|
+-------+----------+-----+-------+----------+
|   1001|         1| 59.0|     NL|2026-03-05|
...

root
 |-- OrderId: integer (nullable = true)
 |-- CustomerId: integer (nullable = true)
 |-- Total: double (nullable = true)
 |-- Country: string (nullable = true)
 |-- OrderDate: date (nullable = true)

Better. But notice what inferSchema=True actually does: Spark reads the entire file once to figure out the types, then reads it again to load the data. On a 6-row CSV that’s free. On a 200GB CSV that’s an extra 200GB scan you didn’t need to do. For anything bigger than “ad-hoc exploration,” this is wrong.

The right answer in production is to declare the schema explicitly:

from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType, DateType

orders_schema = StructType([
    StructField("OrderId",    IntegerType(), nullable=False),
    StructField("CustomerId", IntegerType(), nullable=False),
    StructField("Total",      DoubleType(),  nullable=False),
    StructField("Country",    StringType(),  nullable=False),
    StructField("OrderDate",  DateType(),    nullable=False),
])

df = (spark.read
      .option("header", "true")
      .schema(orders_schema)
      .csv("./data/orders.csv"))

df.show()
df.printSchema()

One pass over the file. Strict types. Type errors fail fast (or get nulled, depending on mode).

CSV options you’ll actually use

There are roughly 30 CSV options. The handful that matter:

df = (spark.read
      .option("header", "true")            # First row is column names
      .option("inferSchema", "false")      # Don't guess — supply schema explicitly
      .option("delimiter", ",")            # Or ";", "\t", "|"...
      .option("quote", '"')                # Field-quoting character
      .option("escape", '"')               # Escape character inside quoted fields
      .option("nullValue", "")             # What string means NULL
      .option("dateFormat", "yyyy-MM-dd")  # ISO-8601 by default
      .option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss")
      .option("mode", "PERMISSIVE")        # PERMISSIVE | DROPMALFORMED | FAILFAST
      .schema(orders_schema)
      .csv("./data/orders.csv"))

The mode is the one nobody talks about and everyone needs:

PERMISSIVE (default) — bad rows get null-filled. The malformed row goes into a magic column called _corrupt_record if your schema includes it. Otherwise silently lost.
DROPMALFORMED — bad rows are dropped silently. Don’t use this. You’ll never know how many rows you lost.
FAILFAST — bad rows raise an exception immediately. Use this in production. Loud failures are better than silent corruption.

JSON: two formats with the same extension

Spark’s JSON reader expects JSON Lines by default — one JSON object per line, no surrounding array, no commas between records:

df = spark.read.json("./data/customers.jsonl")
df.show()
df.printSchema()

+-------+----------+------+
|Country|CustomerId|  Name|
+-------+----------+------+
|     NL|         1|  Anna|
|     IT|         2| Marco|
|     DE|         3|Dieter|
|     RO|         4| Ioana|
+-------+----------+------+

root
 |-- Country:    string (nullable = true)
 |-- CustomerId: long   (nullable = true)
 |-- Name:       string (nullable = true)

Notice JSON gives you types for free. The format encodes integers, floats, booleans, and strings differently, so Spark doesn’t have to guess. (It does still scan the file once to discover the schema — i.e. which fields exist — but the types are unambiguous.)

The other JSON shape you’ll see is a single large array spanning multiple lines:

[
  {"CustomerId": 1, "Name": "Anna", ...},
  {"CustomerId": 2, "Name": "Marco", ...},
  ...
]

This is what most APIs return. Spark’s default JSON reader will mangle it — each line is parsed independently, and the bracket / comma lines fail to parse. To read this format, set multiLine=True:

df = (spark.read
      .option("multiLine", "true")
      .json("./data/customers_pretty.json"))

df.show()

Performance note: multiLine=True means Spark can’t split the file across executors — the JSON parser needs to see the whole file as one stream. On a 50GB pretty-printed JSON, you get one task and one CPU core. Convert to JSON Lines (jq -c '.[]' input.json > output.jsonl) the moment the file gets big.

JSON also handles nested fields naturally:

nested = spark.read.json(spark.sparkContext.parallelize([
    '{"order": 1001, "customer": {"id": 1, "name": "Anna"}}',
    '{"order": 1002, "customer": {"id": 2, "name": "Marco"}}',
]))

nested.printSchema()
nested.select("order", "customer.name").show()

root
 |-- customer: struct (nullable = true)
 |    |-- id:   long   (nullable = true)
 |    |-- name: string (nullable = true)
 |-- order:    long    (nullable = true)

+-----+-----+
|order| name|
+-----+-----+
| 1001| Anna|
| 1002|Marco|
+-----+-----+

The customer column comes through as a struct. We’ll get to nested-data handling properly in Module 4.

Parquet: the one Spark actually likes

Parquet is the format Spark is built for. It’s a columnar binary format with a baked-in schema, per-column compression, and row-group statistics that let Spark skip whole chunks of a file when filtering.

Let’s write the orders DataFrame to Parquet, then read it back:

df = (spark.read
      .option("header", "true")
      .schema(orders_schema)
      .csv("./data/orders.csv"))

df.write.mode("overwrite").parquet("./data/orders.parquet")

That writes a folder, not a single file. Inside ./data/orders.parquet/ you’ll see something like:

_SUCCESS
part-00000-abc123.snappy.parquet
part-00001-abc123.snappy.parquet
...

One file per partition, each individually compressed (Snappy by default). The _SUCCESS marker tells Spark and Hadoop tools “this write completed cleanly.” If you see no _SUCCESS, the writer crashed mid-write and you should treat the output as bad.

Read it back:

df2 = spark.read.parquet("./data/orders.parquet")
df2.show()
df2.printSchema()

+-------+----------+-----+-------+----------+
|OrderId|CustomerId|Total|Country| OrderDate|
+-------+----------+-----+-------+----------+
|   1001|         1| 59.0|     NL|2026-03-05|
...

No options needed. No schema inference. Same types as the source. Faster. Smaller. Strict. This is why Parquet wins.

Why Parquet is fast: column pruning

A row-store like CSV stores 1001,1,59.00,NL,2026-03-05 as one chunk. To read just the Total column, you have to scan every byte of every row.

Parquet stores all the OrderId values together, then all the CustomerId values, then all the Total values. To read just Total, Spark reads only that column off disk. On a wide table (50+ columns), this is often 10-50× faster than CSV.

Run this to feel the difference:

# Read Parquet, project one column
spark.read.parquet("./data/orders.parquet").select("Total").show()

# Read CSV, project one column — Spark still has to parse every row
(spark.read
 .option("header", "true")
 .schema(orders_schema)
 .csv("./data/orders.csv")
 .select("Total")
 .show())

On 6 rows you can’t see the difference. On 6 million rows, the Parquet read is over before the CSV read warms up.

Parquet write modes

df.write.mode("overwrite").parquet(path)   # Wipe and rewrite
df.write.mode("append").parquet(path)      # Add new files alongside old ones
df.write.mode("ignore").parquet(path)      # If path exists, do nothing
df.write.mode("error").parquet(path)       # Default — raise if path exists

error is the default and is the right default — it stops you accidentally clobbering data. In production scripts I usually use overwrite only with date-based path partitioning, so each day writes to its own folder and “overwrite” only affects today’s slot.

Partitioning on write

df.write.mode("overwrite").partitionBy("Country").parquet("./data/orders_part.parquet")

This writes one subdirectory per Country value:

orders_part.parquet/
  Country=NL/
    part-00000-...snappy.parquet
  Country=IT/
    part-00000-...snappy.parquet
  Country=DE/
    part-00000-...snappy.parquet
  Country=RO/
    part-00000-...snappy.parquet

Now spark.read.parquet(...).filter("Country = 'IT'") only reads the Country=IT folder. Skipping data is the cheapest performance win there is.

But — and this matters — partition by low-cardinality columns only. Partitioning by OrderId would create one file per order. Partitioning by Country (4–50 values) is great. Partitioning by OrderDate (one folder per day) is the standard pattern for time-series data. We’ll come back to this in lesson 35.

The schema-on-read tradeoff

Three positions you can take when reading data:

1. Inferred schema (inferSchema=True for CSV; default for JSON).

Pros: zero ceremony, Spark figures it out.
Cons: extra full-file pass for CSV. Fragile — if Friday’s file has a string in a number column, your Monday job breaks. No type guarantees in your code.
When to use: ad-hoc exploration. Single-developer notebooks. Files small enough that the extra pass is free.

2. Explicit schema (.schema(StructType(...))).

Pros: one pass. Strict types. Type errors caught early. Self-documenting.
Cons: more code. Has to be kept in sync with the source if columns are added.
When to use: production. Anything that runs on a schedule. Anything where wrong types would cause data quality issues downstream.

3. Schema in the file (Parquet, ORC, Avro, Delta).

Pros: no schema to maintain. Strict types. Schema evolves with the data.
Cons: requires controlling the writing side too — you can’t do this with CSVs that arrive from a partner.
When to use: any data your own pipeline produces. Move from CSV to Parquet at the first internal hop, then never touch CSV again.

In a typical data lake the pattern is: CSV (with explicit schema) at the ingestion edge, Parquet (or Delta) for everything internal. We’ll spend lesson 13 on schema definition itself — every PySpark type, when to use what, how to handle nullability, schema merging on append.

A complete read-and-show script

Pulling it all together:

from pyspark.sql import SparkSession
from pyspark.sql.types import (StructType, StructField,
                               IntegerType, DoubleType, StringType, DateType)

spark = (SparkSession.builder
         .appName("ReadingData")
         .master("local[*]")
         .config("spark.sql.shuffle.partitions", "8")
         .getOrCreate())
spark.sparkContext.setLogLevel("WARN")

orders_schema = StructType([
    StructField("OrderId",    IntegerType(), False),
    StructField("CustomerId", IntegerType(), False),
    StructField("Total",      DoubleType(),  False),
    StructField("Country",    StringType(),  False),
    StructField("OrderDate",  DateType(),    False),
])

# CSV with explicit schema — production pattern
orders = (spark.read
          .option("header", "true")
          .option("mode", "FAILFAST")
          .schema(orders_schema)
          .csv("./data/orders.csv"))

# JSON Lines — types come for free
customers = spark.read.json("./data/customers.jsonl")

# Write to Parquet for downstream jobs
orders.write.mode("overwrite").partitionBy("Country").parquet("./data/orders.parquet")

# Read back — fast, schema baked in
parq = spark.read.parquet("./data/orders.parquet")

# Quick join
joined = parq.join(customers, on="CustomerId", how="inner") \
             .select("OrderId", "Name", "Country", "Total", "OrderDate")
joined.show()

spark.stop()

If that runs end-to-end and prints a 6-row joined table with Italian, Dutch, German, and Romanian customers, you’ve got the read pipeline working. Everything else in PySpark builds on this.

Next module: actually working with DataFrames. Selecting columns, filtering rows, the difference between select and selectExpr, and why .show() lies to you about what your job is doing.