Python, from the ground up Lesson 25 / 60

Pandas in 2026: what to know, what to skip, Polars rising

Where pandas sits in the 2026 data ecosystem, the PyArrow backend, and when Polars genuinely wins.

Welcome to Module 5. We’ve spent the first four modules on the language itself and the tooling around it; now we shift gears and spend the next ten lessons on what most working Python developers actually do with the language all day, which is push tabular data through transformations. Roughly half the Python jobs I’ve seen in the last decade were, fundamentally, “read a file, change some columns, write another file, send a Slack message when it breaks.” That work runs on pandas. Module 5 is pandas mastery; Module 6 is the harder stuff (window functions, time series, performance). Lesson 35 is a dedicated chapter on Polars, which we’ll meet briefly today.

But before we touch a read_csv, we need to talk about where pandas sits in 2026, what changed in the last three years, and why your old StackOverflow answers might be giving you advice that the library no longer agrees with.

A short history, because context matters

Pandas was started by Wes McKinney in 2008 at the hedge fund AQR. He needed something between Excel and a SQL database for time-series-heavy financial data, found that what existed in Python at the time was either NumPy (too low-level) or R-via-rpy2 (too clumsy), and wrote his own. He open-sourced it the next year, left finance, and has spent the better part of two decades shepherding the data ecosystem. (He also kicked off the Apache Arrow project in 2016 — keep that in mind, it’ll matter in five paragraphs.)

For roughly fifteen years, pandas was the answer to “I have a table in Python.” Not the only answer — there were always alternatives, from dask for out-of-core data to PySpark for cluster-scale — but the default, the thing every notebook on Kaggle and every junior data scientist’s laptop reached for. Every adjacent library was built to accept a pandas DataFrame: matplotlib’s df.plot(), scikit-learn’s fit(X, y) where X is usually a DataFrame, statsmodels, seaborn, plotly, the SQL-via-DataFrame wrappers, the database connectors. The ecosystem moat is enormous, and that matters for what we’ll discuss in a moment.

The problems were always the same. Pandas was built on top of NumPy, which had no native concept of missing values for integer or boolean columns (so a column with one missing integer became float64, with the missing value as NaN — surprising and wrong). String columns were stored as Python objects in the NumPy array, which meant slow string operations and bloated memory. The default copy-on-write semantics were inconsistent — sometimes you got a view, sometimes a copy, and the famous SettingWithCopyWarning haunted everyone. None of these were fatal, but they accumulated.

Pandas 2.0 (2023) — the quiet rewrite

Pandas 2.0 came out in April 2023 and is the most consequential release since 1.0. The headline change: a PyArrow backend as an alternative to NumPy under the hood.

What that means in practice. Apache Arrow is a columnar memory format designed for analytical workloads. It has first-class support for missing values (every type is nullable), efficient string storage, fast zero-copy interchange between languages and tools, and is the lingua franca that DuckDB, Polars, Spark, and modern data warehouses all speak. By giving pandas an Arrow backend, the maintainers got nullable integer columns, ten-times-faster string operations, and a shared memory format with the rest of the data world — without breaking the millions of lines of existing pandas code.

Three things to know about it as of May 2026:

It’s opt-in. Reading a CSV with pd.read_csv("file.csv") still gives you the NumPy backend. You ask for Arrow explicitly:

import pandas as pd

df = pd.read_csv("sales.csv", dtype_backend="pyarrow")
df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")

# Or globally for a script
pd.options.future.infer_string = True  # arrow-backed strings by default

It’s mature now. The first year of the Arrow backend (2023) had rough edges — some operations fell back to NumPy and copied data, some didn’t work at all. By pandas 2.2 (early 2024) most of those were closed, and by the 2.3 series in 2025 the backend is stable for the operations a normal analyst uses. If you’re starting a new project in 2026, Arrow-backed pandas is the right default.

Copy-on-write is the new default, and it’s a real improvement. The old “is this a view or a copy?” footgun is gone in pandas 3.0 (released late 2025), and you can opt into it now in 2.x with pd.options.mode.copy_on_write = True. New code; new defaults; turn it on.

Polars — the rising alternative

While pandas was rewriting its plumbing, Polars was being written from scratch in Rust by Ritchie Vink. The first version landed in 2020; version 1.0 in mid-2024; by May 2026 it’s at 1.x and stable enough for production. It’s the most credible challenger pandas has had since the modern era began.

The pitch is easy: Polars is built on Arrow from day one (no legacy NumPy carry-overs), is multi-threaded by default, has a query optimizer (lazy mode plans your whole pipeline before executing it), and is written in a systems language so the inner loops are not paying Python’s interpreter tax. On most operations — filtering, joining, group-by, aggregation — Polars is two to ten times faster than pandas with the Arrow backend, and uses meaningfully less memory.

It also has a more consistent API. Pandas accumulated fifteen years of “we’ll add this method too” decisions, leading to three ways to filter rows, four ways to add a column, and an index system that is genuinely confusing to newcomers. Polars made the API uniform: everything is an expression on a column, executed in the context of a DataFrame.

Quick syntax comparison. Group some sales by region and compute the mean and total, filtering to high-value transactions first:

# pandas
import pandas as pd

df = pd.read_parquet("sales.parquet", dtype_backend="pyarrow")
result = (
    df[df["amount"] > 1000]
    .groupby("region", as_index=False)
    .agg(mean_amount=("amount", "mean"), total=("amount", "sum"))
)
# Polars
import polars as pl

df = pl.read_parquet("sales.parquet")
result = (
    df.filter(pl.col("amount") > 1000)
      .group_by("region")
      .agg(
          mean_amount=pl.col("amount").mean(),
          total=pl.col("amount").sum(),
      )
)

Lazy mode is where Polars really pulls ahead — pl.scan_parquet instead of pl.read_parquet, and the engine plans the whole computation, pushes filters down to the file reader, prunes columns you don’t use, and only executes when you call .collect(). We’ll see it in lesson 35.

So which one should you learn?

Both. But if you only have time for one first, it’s still pandas, and here’s why.

Choose pandas when:

  • The rest of your stack expects a DataFrame. Scikit-learn, statsmodels, matplotlib, seaborn, plotly, every tutorial, every Kaggle notebook, every textbook on quantitative finance written before 2025 — they assume df is a pandas.DataFrame. The ecosystem moat is the deciding factor most days.
  • You’re doing interactive analysis in a notebook. Pandas’ richer printing, better Jupyter integration, and the muscle memory of df.head(), df.describe(), df["col"].value_counts() is hard to give up.
  • Your data fits comfortably on one machine, which usually means under 50 GB of in-memory data. Pandas with the Arrow backend is fine in that range.
  • You work with a team that already knows pandas. Switching costs are real.

Choose Polars when:

  • You’re building a performance-critical batch pipeline that runs on a schedule. The two-to-ten times speedup compounds.
  • Your data is at the awkward size — too big for pandas to be comfortable, too small to justify Spark. Polars handles tens of millions of rows on a laptop in a way that pandas struggles with.
  • It’s a fresh project with no legacy and no team-knowledge constraint.
  • You want lazy evaluation, query optimization, and out-of-core support out of the box.

The “use both” pattern is genuinely common in 2026. Ingest and clean with Polars (because it’s fast and the operations are clean), then .to_pandas() at the boundary where you hand the data to scikit-learn or statsmodels. Polars has a zero-copy to_pandas() when both sides use Arrow, so the cost is essentially nothing.

import polars as pl

# Heavy lifting in Polars
clean = (
    pl.scan_parquet("raw/*.parquet")
      .filter(pl.col("status") == "completed")
      .group_by("user_id")
      .agg(pl.col("amount").sum())
      .collect()
)

# Hand off to pandas for sklearn
df = clean.to_pandas(use_pyarrow_extension_array=True)

from sklearn.linear_model import LinearRegression
# ... fit a model on df ...

What we’re installing

For the rest of Module 5 and 6, you’ll need:

uv add pandas pyarrow numpy
uv add polars  # for lesson 35 (and for trying things alongside pandas)

PyArrow is a hard dependency for the Arrow backend, so add it now even if you’re not sure you’ll use it — half the modern read functions accept it and several pandas features warn or fail without it.

You do not need Numba, Cython, Dask, or any of the other “make pandas faster” packages. We’ll touch on Dask in Module 6 when we discuss out-of-core, but for the next ten lessons it’s pure pandas, plus the Arrow backend, plus a pinch of Polars to keep your perspective honest.

What this module covers

The next ten lessons:

  • 26 Series and DataFrame — the data model, finally explained.
  • 27 Reading data — CSV, Parquet, Excel, JSON, SQL, the gotchas.
  • 28 Selection and filtering — loc, iloc, boolean masks, query.
  • 29 Adding and transforming columns — assign, apply, vectorization.
  • 30 GroupBy — split-apply-combine, the heart of pandas.
  • 31 Joins and merges — what every SQL developer expects, and the index trap.
  • 32 Reshaping — pivot, melt, stack, unstack, the long/wide dance.
  • 33 Missing data — NaN, pd.NA, the Arrow null story.
  • 34 Strings and categoricals — text columns done well.
  • 35 Polars — same problems, different library.

Lesson 26 is Friday: the data model that everything in the next nine lessons sits on top of. Skip it at your peril; people who hand-wave through the Series/DataFrame/Index distinction spend the rest of their pandas career confused.

Further reading

See you Friday.

Search