Python, from the ground up Lesson 48 / 60

Numerical project: a real analysis

Take a public dataset and run a complete numerical analysis: descriptive stats, fits, hypothesis tests, plots.

Module 8 has been a tour: NumPy in lesson 41, pandas in 42, matplotlib and seaborn in 43, statistics with SciPy in 44, scientific computing with SciPy in 45, then a detour through features-I-learned-too-late and notebooks. We’ve got pieces. This lesson glues them together into a real numerical analysis end to end.

The goal isn’t to teach a new library. It’s to show what a complete analysis looks like when you sit down at a notebook with a CSV, a question, and an afternoon.

The dataset

Pick a public dataset. I’ll use the Olympic athletes dataset from Kaggle (120 years of summer and winter Olympics) because it has a clean mix of categorical, ordinal, and continuous variables. If you want different data, the World Bank’s GDP/population CSVs work, or any of the public health datasets on data.gov.

Download athlete_events.csv. Roughly 270k rows, columns including Sex, Age, Height, Weight, Sport, Year, Medal.

The question I’ll chase: have male marathon medalists changed in body composition over the 120-year history of the modern Olympics?

That’s a real question. It has a measurable answer. It’s the kind of thing you might do as a warmup before a more serious project.

Step 1: load and inspect

from pathlib import Path
import pandas as pd
import numpy as np

DATA = Path("athlete_events.csv")
df = pd.read_csv(DATA)

print(f"{df.shape=}")
print(df.dtypes)
df.head()

Always inspect first. Notice missingness, types, obvious junk. Age, Height, Weight will have NaNs (early Olympics didn’t always record them). Medal is NaN for non-medalists.

Step 2: clean and subset

marathons = (
    df[(df["Sport"] == "Athletics")
       & (df["Event"].str.contains("Marathon", na=False))
       & (df["Sex"] == "M")
       & df["Medal"].notna()]
    .dropna(subset=["Age", "Height", "Weight"])
    .copy()
)

marathons["BMI"] = marathons["Weight"] / (marathons["Height"] / 100) ** 2
print(f"{len(marathons)} medalist-events after cleaning")
marathons[["Year", "Age", "Height", "Weight", "BMI"]].describe()

Three things matter here. First, I made a .copy() to avoid the SettingWithCopyWarning when adding the BMI column. Second, I dropped only rows missing the variables I’ll actually use. Third, I’m computing BMI as a derived feature because it’s a more meaningful “body composition” number than weight or height alone.

Always print sample sizes after a filter. If you have 18 rows when you expected 200, your filter is wrong.

Step 3: descriptive statistics

This is the step most tutorials skim over. Don’t.

from scipy import stats

def describe(series: pd.Series) -> dict:
    return {
        "n": len(series),
        "mean": series.mean(),
        "median": series.median(),
        "std": series.std(),
        "p25": series.quantile(0.25),
        "p75": series.quantile(0.75),
        "iqr": series.quantile(0.75) - series.quantile(0.25),
        "skew": stats.skew(series),
        "kurtosis": stats.kurtosis(series),
    }

for col in ["Age", "Height", "Weight", "BMI"]:
    s = describe(marathons[col])
    print(f"{col}: " + ", ".join(f"{k}={v:.2f}" for k, v in s.items()))

Skewness measures asymmetry; kurtosis measures tail-weight relative to a normal. For BMI of elite marathoners, you’d expect a tight, symmetric distribution. If skewness is far from zero or kurtosis is high, you have outliers worth investigating before any test.

Step 4: visualize

A summary statistic is a lie until you see the distribution.

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
fig, axes = plt.subplots(2, 2, figsize=(11, 8))
for ax, col in zip(axes.flat, ["Age", "Height", "Weight", "BMI"]):
    sns.histplot(marathons[col], kde=True, ax=ax, bins=30)
    ax.set_title(col)
fig.tight_layout()
plt.savefig("marathon_distributions.png", dpi=150)

You’ll typically see Age skewed slightly right (a few older medalists), Height roughly normal, Weight right-skewed, BMI tight around 20-21.

Now the change-over-time plot:

marathons["Era"] = pd.cut(
    marathons["Year"],
    bins=[1895, 1948, 1980, 2025],
    labels=["1896-1948", "1952-1980", "1984-2024"],
)

fig, ax = plt.subplots(figsize=(9, 5))
sns.boxplot(data=marathons, x="Era", y="BMI", ax=ax)
ax.set_title("BMI of male marathon medalists by era")
fig.tight_layout()
plt.savefig("bmi_by_era.png", dpi=150)

If the boxes look identical, your hypothesis is dead before you test it. If the median visibly moves, you have something worth a formal test.

Step 5: hypothesis test

Suppose the boxplot shows the modern era has lower BMI. Formal question: is the difference statistically significant?

First check normality. If both groups are roughly normal, use a t-test. If not, use Mann-Whitney U.

old = marathons.loc[marathons["Era"] == "1896-1948", "BMI"].to_numpy()
new = marathons.loc[marathons["Era"] == "1984-2024", "BMI"].to_numpy()

print(f"old: n={len(old)}, mean={old.mean():.2f}")
print(f"new: n={len(new)}, mean={new.mean():.2f}")

# Shapiro-Wilk normality
print("Shapiro old:", stats.shapiro(old))
print("Shapiro new:", stats.shapiro(new))

# Welch's t-test (unequal variance)
t = stats.ttest_ind(old, new, equal_var=False)
print(f"Welch t: t={t.statistic:.3f}, p={t.pvalue:.4f}")

# Mann-Whitney U as a non-parametric backup
u = stats.mannwhitneyu(old, new, alternative="two-sided")
print(f"Mann-Whitney: U={u.statistic:.1f}, p={u.pvalue:.4f}")

This is exactly the place AI assistants are surprisingly useful. If you describe your data — “two independent groups, continuous outcome, possibly non-normal, n=30 vs n=80” — Claude or GPT will recommend the right test and explain its assumptions. They’re far better at statistical-test selection than you might expect, because the decision tree is well-documented training material. Verify the recommendation against the SciPy docs, but treat it as a strong starting point.

Step 6: curve fitting

For variety, let’s fit a curve. Pick a different question: how has the world record marathon time changed over the years? You’d need a separate dataset for that, but the technique is the same. Here I’ll fit a logistic curve to the cumulative count of unique marathon medalists by year as a stand-in for “growth of the event”:

from scipy.optimize import curve_fit

yearly = (
    marathons.groupby("Year").size()
    .cumsum()
    .reset_index(name="cumulative")
)

def logistic(x, L, k, x0):
    return L / (1 + np.exp(-k * (x - x0)))

x = yearly["Year"].to_numpy(dtype=float)
y = yearly["cumulative"].to_numpy(dtype=float)

p0 = [y.max(), 0.05, x.mean()]
popt, pcov = curve_fit(logistic, x, y, p0=p0, maxfev=5000)
L_fit, k_fit, x0_fit = popt

x_smooth = np.linspace(x.min(), x.max(), 200)
y_smooth = logistic(x_smooth, *popt)

fig, ax = plt.subplots(figsize=(9, 5))
ax.scatter(x, y, label="observed", alpha=0.6)
ax.plot(x_smooth, y_smooth, color="red", label=f"logistic (L={L_fit:.0f})")
ax.set_xlabel("Year")
ax.set_ylabel("Cumulative medalists")
ax.legend()
fig.tight_layout()
plt.savefig("marathon_growth.png", dpi=150)

curve_fit returns the fitted parameters and their covariance matrix. The square root of the diagonal of pcov gives standard errors:

perr = np.sqrt(np.diag(pcov))
print(f"L = {L_fit:.1f} +/- {perr[0]:.1f}")
print(f"k = {k_fit:.4f} +/- {perr[1]:.4f}")
print(f"x0 = {x0_fit:.1f} +/- {perr[2]:.1f}")

Always report uncertainties. A fit without error bars is half a result.

Step 7: write up

The step everyone skips. Format the final result so a human can read it without looking at code:

def summary(old: np.ndarray, new: np.ndarray, t_result, u_result) -> str:
    return (
        f"Marathon medalist BMI: era 1896-1948 (n={len(old)}) "
        f"mean {old.mean():.2f} (SD {old.std():.2f}); "
        f"era 1984-2024 (n={len(new)}) "
        f"mean {new.mean():.2f} (SD {new.std():.2f}). "
        f"Welch t = {t_result.statistic:.2f}, p = {t_result.pvalue:.4f}. "
        f"Mann-Whitney U = {u_result.statistic:.0f}, p = {u_result.pvalue:.4f}."
    )

print(summary(old, new, t, u))

The interpretation step is the one most tutorials drop and the one that matters most. A p-value under 0.05 doesn’t mean “modern marathoners are slimmer.” It means “if there were no real difference, the chance of seeing data at least this different would be under 5%.” Then ask: is the effect size practically meaningful? A 0.3-point BMI difference at p=0.001 is statistically significant and physically irrelevant. Always report the effect alongside the p-value.

The runnable script

The whole analysis above, factored into a single .py file, is around 150 lines. Here’s the skeleton:

from pathlib import Path
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.optimize import curve_fit

def load(path: Path) -> pd.DataFrame:
    df = pd.read_csv(path)
    return df

def select_marathons(df: pd.DataFrame) -> pd.DataFrame:
    mask = (
        (df["Sport"] == "Athletics")
        & df["Event"].str.contains("Marathon", na=False)
        & (df["Sex"] == "M")
        & df["Medal"].notna()
    )
    out = df[mask].dropna(subset=["Age", "Height", "Weight"]).copy()
    out["BMI"] = out["Weight"] / (out["Height"] / 100) ** 2
    return out

def by_era(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["Era"] = pd.cut(
        df["Year"],
        bins=[1895, 1948, 1980, 2025],
        labels=["1896-1948", "1952-1980", "1984-2024"],
    )
    return df

def test_bmi(df: pd.DataFrame) -> tuple:
    old = df.loc[df["Era"] == "1896-1948", "BMI"].to_numpy()
    new = df.loc[df["Era"] == "1984-2024", "BMI"].to_numpy()
    t = stats.ttest_ind(old, new, equal_var=False)
    u = stats.mannwhitneyu(old, new, alternative="two-sided")
    return old, new, t, u

if __name__ == "__main__":
    df = load(Path("athlete_events.csv"))
    marathons = by_era(select_marathons(df))
    old, new, t, u = test_bmi(marathons)
    print(f"old n={len(old)} mean={old.mean():.2f}")
    print(f"new n={len(new)} mean={new.mean():.2f}")
    print(f"t={t.statistic:.2f}, p={t.pvalue:.4f}")
    print(f"U={u.statistic:.0f}, p={u.pvalue:.4f}")

Notice the structure: each step is a function with a clear input and output. The if __name__ == "__main__": block is the recipe. Now the analysis is testable, schedulable, and importable — exactly the qualities lesson 47 said notebooks lack.

A few traps worth naming

Multiple comparisons. The moment you start running tests on more than one outcome — BMI, age, height, weight, by era, by region — your p-values stop meaning what you think they mean. With twenty independent tests at p<0.05, you expect one false positive on average. Apply Bonferroni correction (divide your alpha by the number of tests) or, less brutally, the Benjamini-Hochberg procedure. SciPy has stats.false_discovery_control for the latter.

Confounding. If modern marathoners have lower BMI and are also mostly East African — which they are — and East Africans on average have different body composition than the early 20th-century European medalists, you’re partly measuring “where the medalists come from” and partly measuring “what era they’re from.” A cleaner analysis would stratify by region or use a regression with both era and region as predictors.

Missing not at random. The early Olympics didn’t always record height and weight. The athletes for whom they were recorded may not be a representative sample. A finding that depends on early-Olympic data needs a sensitivity analysis: re-run excluding pre-1948 and see if the conclusion holds.

These aren’t reasons to abandon the analysis. They’re reasons to caveat the result and think hard about what you’d need to address them. A good write-up names its limitations explicitly.

A quick checklist for your own analysis

Before you call an analysis done:

  • Do I know the sample size at every filtering step?
  • Did I plot every variable I tested before I tested it?
  • Did I report effect sizes, not just p-values?
  • Did I name the assumptions of the test I used?
  • Could I re-run this from the raw CSV with one command?
  • Could a colleague reproduce my numbers from the script alone?

If any answer is “no,” the analysis isn’t finished. It might still be useful — a quick exploration is fine — but don’t ship a finding until the checklist is green.

What you’ve built

You loaded a real dataset, cleaned it, computed descriptive statistics, visualized distributions, ran a hypothesis test, fit a curve, and wrote a script that does it all reproducibly. That’s a complete numerical workflow. Every analysis you’ll do for the rest of your career has roughly this shape: load, clean, summarize, visualize, test, model, write up.

The exact libraries change. The shape doesn’t. Whether you’re working with Olympic records, climate data, financial returns, or A/B test outcomes, the steps are the same and the discipline is the same. The hard part isn’t memorizing function signatures — those are an autocomplete away. The hard part is asking the right question, choosing the right test, and being honest about what your numbers do and don’t say.

This concludes Module 8 — numerical Python. You now have NumPy, pandas, matplotlib, seaborn, and SciPy in your toolkit, plus the discipline to use them in a structured way. Module 9 picks up with scikit-learn and the move from descriptive analysis into predictive machine learning. Same data, different question: instead of “what does it look like?” we ask “what can we predict?”

Search