Python, from the ground up Lesson 32 / 60

Time series: resample, rolling, the date-time gotchas

DatetimeIndex, frequency conversion, rolling windows, and the timezone bugs that bite production.

Pandas was originally written at AQR, a quantitative hedge fund, to handle financial time series. It shows. The time-series API is one of the most polished parts of the library — once you give it a DataFrame with a DatetimeIndex, a whole machinery opens up for you: frequency conversion, rolling windows, gap filling, timezone arithmetic. Today we cover the parts that matter for normal analytical work, and the timezone bug that shows up in every production pipeline at least once.

DatetimeIndex: the unlock

A DataFrame with a DatetimeIndex is a time series — pandas knows how to work with it. A DataFrame with a created_at column of type datetime is just a DataFrame with a datetime column; you can filter on it but you don’t get the time-series machinery. Setting one is the first step:

import pandas as pd

df = pd.read_csv("sales.csv", parse_dates=["ts"], date_format="ISO8601")
df = df.set_index("ts").sort_index()

Two things matter here. One, set the index after parsing — parse_dates= makes the column a real datetime, then set_index promotes it to the index. Two, sort_index() — most time-series operations assume the index is monotonically increasing, and you’ll get either wrong answers or warnings if it isn’t. Sort once at load time.

Once you have a DatetimeIndex, partial-string indexing just works:

df["2026"]                     # all rows in 2026
df["2026-03"]                  # all rows in March 2026
df["2026-03-15":"2026-03-20"]  # a slice

This is one of those pandas conveniences that feels magical the first time it works.

Frequency strings

Almost every time-series function takes a frequency string — pandas’ shorthand for “what bucket size do you want.” The ones you’ll actually use:

  • "D" — daily
  • "H" or "h" — hourly (the lowercase form is the modern one in pandas 2.2+; both still work)
  • "min" — minutes (older "T" still accepted but deprecated)
  • "W" — weekly (defaults to week-ending Sunday; use "W-MON", "W-FRI", etc. for other anchors)
  • "M" / "ME" — month-end. In pandas 2.2+, "ME" is the explicit alias and "M" will eventually warn.
  • "MS" — month-start
  • "Q" / "QE" — quarter-end ("QS" for quarter-start)
  • "Y" / "YE" — year-end ("YS" for year-start)

You can also combine them: "5min", "15min", "4H", "2W". Pandas accepts almost any sensible composition.

Resampling: resample()

resample is groupby for time. It buckets your rows by a frequency and lets you aggregate.

Downsampling — going from higher to lower frequency, e.g. minute data to daily totals. This is the common case:

daily = df.resample("D").sum()
hourly_avg = df.resample("h").mean()
weekly_max = df.resample("W").max()

You can apply multiple aggregations at once with .agg:

df.resample("D").agg({"sales": "sum", "visitors": "mean", "orders": "count"})

Upsampling — going from lower to higher frequency. This creates rows that don’t exist in the source, and you need to tell pandas how to fill them:

df.resample("h").ffill()   # forward-fill: carry the last known value forward
df.resample("h").bfill()   # back-fill: pull the next known value backward
df.resample("h").interpolate()  # linear interpolation between known points

Forward-fill is the safe default for things like sensor readings where “the last value is still true until a new one arrives.” Interpolation is right for genuinely continuous quantities (temperature, prices). Don’t interpolate counts.

asfreq — for changing the index frequency without aggregating. Useful when you want a regular grid but no fancy aggregation:

df.asfreq("D", method="ffill")   # one row per day, forward-filled

This is also how you align a sparse, irregular series onto a regular calendar — common when you have one row per event and you want one row per day.

Rolling windows: rolling()

A rolling window is a moving average’s bigger cousin: it lets you compute any aggregation over a sliding window of the data.

df["sales"].rolling(window=7).mean()        # 7-row trailing mean
df["sales"].rolling(window=7).std()         # 7-row trailing standard deviation
df["sales"].rolling(window="7D").mean()     # last 7 days, regardless of row count

The window argument is the killer feature. Pass an integer and you get a row-count window; pass a frequency string and you get a time-based window. The latter is robust to gaps in your data: a window="7D" rolling mean over a sparse time series gives you “the mean of all data in the last 7 days,” whatever the row count happens to be. Row-based windows give you something nonsensical when there are gaps.

min_periods — by default, rolling outputs NaN until the window is full. The first six days of a 7-day rolling average are NaN. To allow partial windows:

df["sales"].rolling(window=7, min_periods=1).mean()

Now day 1 returns the mean of itself, day 2 of the first two, and so on. Useful when you want a series with no leading NaN, but be aware that the early values are based on less data — don’t compare them to the mature ones without context.

rolling has the full aggregation menu — mean, sum, std, min, max, median, count, quantile, plus apply for arbitrary functions. The arbitrary-function path is slower; if you can express what you want with the built-ins, do.

Expanding windows

expanding() is rolling with a window that grows from the start instead of sliding. Day 1’s value is just day 1; day 2’s is the aggregate over days 1-2; day 100’s is the aggregate over days 1-100.

df["sales"].expanding().mean()   # cumulative average from the start
df["sales"].expanding().sum()    # equivalent to .cumsum()
df["sales"].expanding().max()    # running maximum

It’s the right tool for “running total since the start of the period” — running revenue, all-time high, lifetime average — anywhere you’d otherwise reach for cumsum or cummax and want a more flexible aggregator.

Exponentially weighted: ewm()

ewm (exponentially weighted) is rolling’s smarter cousin. Instead of treating every point in the window equally, it gives more weight to recent points and less to older ones, smoothly:

df["sales"].ewm(span=7).mean()      # 7-period span (analogous to a 7-day average)
df["sales"].ewm(halflife=7).mean()  # weight halves every 7 periods
df["sales"].ewm(alpha=0.3).mean()   # explicit smoothing factor

Use ewm when you want responsiveness — an EWMA reacts to a level change faster than a simple moving average of the same span. It’s the standard for financial volatility, anomaly detection, and any dashboard where “what’s happening recently” matters more than “what happened over the whole window.”

shift(): lag and lead

shift(n) moves your data n rows down (positive n) or up (negative n). With a regular DatetimeIndex and frequency, it lags by that many periods:

df["sales_yesterday"] = df["sales"].shift(1)
df["sales_next_week"] = df["sales"].shift(-7)
df["wow_growth"] = df["sales"] / df["sales"].shift(7) - 1   # week-on-week

This is the building block for any “compared to last period” calculation. The first n (or last, for negative shifts) values are NaN — there is no yesterday for the first day.

Timezones: the production trap

Now the part that bites everyone. Datetimes in pandas come in two flavours: naive (no timezone attached, just a wall-clock string) and aware (timezone attached). They don’t mix — you cannot compare a naive timestamp to an aware one without an error.

This was lesson 8’s territory in the language; here it shows up in DataFrame indexes. The two operations:

# the index is naive; you know it's actually UTC
df.index = df.index.tz_localize("UTC")

# now convert to a display timezone
df.index = df.index.tz_convert("Europe/Rome")

tz_localize attaches a timezone (no value change). tz_convert changes the timezone (the wall-clock value updates accordingly). Localising to the wrong zone, then converting, is one of the classic ways to ship “timestamps off by 6 hours” to production.

The deeper trap: DST and resample. Suppose your index is Europe/Rome and you resample("D").sum(). On the spring-forward day, you have 23 hours in your “day”; on autumn-back, 25. Pandas handles this correctly if the index is timezone-aware, and gets it subtly wrong if you’ve stored localtime as a naive index. The rule: store data in UTC, convert to local for display only. Set tz_localize("UTC") at ingestion if your source is UTC, tz_convert("UTC") if it’s another zone, and then never reason in localtime within your pipeline again. Convert at the edge.

A real example: e-commerce daily sales with rolling and weekly views

Putting it all together — typical first-pass analysis of an e-commerce orders table:

import pandas as pd

orders = pd.read_parquet("orders.parquet")
orders["created_at"] = pd.to_datetime(orders["created_at"], utc=True)
orders = orders.set_index("created_at").sort_index()

# Daily sales totals in UTC
daily = orders["amount"].resample("D").sum().rename("daily_sales")

# 7-day rolling average for the dashboard
daily_smooth = daily.rolling("7D", min_periods=1).mean().rename("rolling_7d")

# Weekly totals for the email
weekly = daily.resample("W-MON").sum().rename("weekly_sales")

# Week-on-week growth
weekly_growth = (weekly / weekly.shift(1) - 1).rename("wow_growth")

report = pd.concat([daily, daily_smooth], axis=1)
report.tail(30).to_csv("daily_dashboard.csv")
weekly_growth.to_csv("weekly_growth.csv")

Three views of the same data, each appropriate for a different consumer: a 30-day daily dashboard with a smoothed line, a weekly summary, a week-on-week growth series. None of it is more than a few lines because the index is doing the work.

What’s next

Lesson 33 is dtype optimisation — category and string[pyarrow], the two dtypes that turn a “this dataset doesn’t fit in memory” problem into “this dataset is fine.” If you’ve ever watched a pandas job swap to disk and die, you’ll want to read it.

Further reading

See you Tuesday.

Search