scikit-learn: the standard ML library tour

We’re now at module 9, which is about machine learning. If you’ve read every lesson up to this point, you have the data engineering chops — you know how to wrangle a DataFrame, build an ETL, and visualize the result. ML is what some people do with that data afterwards. It’s not a replacement for clean pipelines and good SQL; it’s a tool that, once in a while, lets you predict something useful from the patterns in your data.

The first three lessons of this module are the opener: the standard library (this one), the part that actually moves the needle (lesson 50, feature engineering), and the model family that wins on tabular data (lesson 51, trees). Everything else — neural networks, NLP, recommendation systems — comes later or is its own course. We start with scikit-learn because it’s where almost every Python ML career begins, and where most of them stay for tabular work.

What scikit-learn is, in 2026

scikit-learn has been around since 2007. It’s on the 1.x line, which means the API is stable and the company you’re working at probably has it pinned somewhere in a requirements.txt. It’s not the fastest, it’s not the deepest, and it doesn’t do GPUs. It is, however, the most consistent API in all of data science, and it’s the library that taught a generation of practitioners what an estimator is supposed to look like.

You install it the usual way:

uv add scikit-learn

And the import is sklearn, not scikit-learn. Yes, that’s annoying. It’s been that way for nearly two decades and nobody is going to change it.

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

The fit/predict pattern

Here is the entire mental model for scikit-learn. Every model — and there are dozens — exposes the same two methods:

model.fit(X, y)         # learn from data
model.predict(X_new)    # predict on new data

That’s it. A linear regression, a random forest, a support vector machine, a k-nearest neighbors classifier — all of them look like this. Once you’ve used one estimator, you’ve used all of them. Switching from logistic regression to gradient boosting is a one-line change.

Classifiers usually also expose .predict_proba(X_new), which gives you class probabilities instead of hard class labels, which is what you actually want most of the time:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)              # array of 0/1
probabilities = model.predict_proba(X_test)       # array of (P(0), P(1)) pairs

The shape convention is just as consistent: X is two-dimensional with shape (n_samples, n_features) — a NumPy array or a pandas DataFrame, both work. y is one-dimensional with length n_samples. If you have a single feature, you still need to reshape it to a column: X.reshape(-1, 1). Forgetting this gives you the famous “Reshape your data either using array.reshape(-1, 1) if your data has a single feature” error, which every beginner sees at least once.

Transformers — things that change X rather than predict from it — follow a slightly different pattern:

scaler.fit(X_train)                # learn the means and standard deviations
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Or, equivalently:
X_train_scaled = scaler.fit_transform(X_train)

The crucial detail: you call .fit on the training data only, then .transform everything. Fitting on test data is a leak. We’ll come back to this in lesson 50.

The model families

Let me give you the map. The taxonomy isn’t strict, but knowing which library lives in which submodule speeds up the docs hunt enormously.

Linear models (sklearn.linear_model). The classics: LinearRegression for regression, LogisticRegression for classification (despite the name). Ridge and Lasso add L2 and L1 regularization respectively, which is what you want when you have lots of features or correlated features. Linear models are fast, interpretable, and shockingly competitive on many real datasets — especially after good feature engineering.

from sklearn.linear_model import Ridge, Lasso, LogisticRegression

ridge = Ridge(alpha=1.0)        # alpha controls regularization strength
lasso = Lasso(alpha=0.1)        # Lasso also does feature selection (sets some weights to zero)
logreg = LogisticRegression(C=1.0, max_iter=1000)  # C is inverse regularization, watch out

Tree-based models (sklearn.tree, sklearn.ensemble). DecisionTreeClassifier is the building block. RandomForestClassifier and GradientBoostingClassifier are the workhorses. In practice, for serious gradient boosting, you reach for XGBoost or LightGBM instead of scikit-learn’s built-in — they’re faster and more accurate. We’ll do that in lesson 51. But scikit-learn’s random forest is genuinely good, and the API is identical, so it’s a reasonable default when you want a strong baseline in three lines.

Instance-based (sklearn.neighbors). KNeighborsClassifier predicts based on the closest k training points. Simple, sometimes surprisingly effective, slow at predict time on large datasets.

Kernel methods (sklearn.svm). SVC and SVR for classification and regression. Beautiful theory, used to dominate, now mostly displaced by trees and neural nets in tabular and image work respectively. Still useful for small-to-medium datasets and when you want a smooth decision boundary.

Neural networks (sklearn.neural_network). MLPClassifier exists but is limited — no GPU, no real architecture flexibility. If you actually need a neural net, use PyTorch. The MLP in scikit-learn is fine for a first sketch.

Clustering, dimensionality reduction, and the rest live in sklearn.cluster (KMeans, DBSCAN), sklearn.decomposition (PCA, NMF), and sklearn.manifold (t-SNE, UMAP-adjacent things). They follow the same .fit / .transform / .fit_predict pattern.

Preprocessing

Most models expect numbers in a sensible range. scikit-learn’s preprocessing module has the standard tools:

from sklearn.preprocessing import (
    StandardScaler,    # mean 0, std 1 — the default for most things
    MinMaxScaler,      # rescale to [0, 1] — for bounded inputs
    RobustScaler,      # uses median and IQR — for data with outliers
    OneHotEncoder,     # turn categories into 0/1 columns
    OrdinalEncoder,    # turn categories into integer codes (use only if there's order)
    KBinsDiscretizer,  # bin continuous values into intervals
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The thing to know: tree-based models don’t need scaling. They split on thresholds, and a threshold on feature * 1 is the same as a threshold on feature * 1000. Linear models, neural nets, KNN, and SVMs all need scaling — without it, the feature with the biggest numeric range dominates everything.

For categorical features, the rule of thumb is: low cardinality (under ~20 categories) gets OneHotEncoder; high cardinality needs target encoding or hashing tricks, which we’ll cover in lesson 50. Never use OrdinalEncoder on a feature that has no natural order — ["red", "green", "blue"] becoming [0, 1, 2] will lie to your model about distances between categories.

Train/test split

You don’t evaluate a model on the data it learned from. Doing so gives you the training error, which always looks great and tells you nothing about the future. The standard split:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,        # set this for reproducibility
    stratify=y,             # preserve class proportions in both splits — for classification
)

The random_state=42 thing is a meme but it’s also the right move: you want the same split every time you run the script, otherwise debugging is impossible. The stratify=y argument matters more than people realize — without it, on imbalanced datasets you can end up with a test set that has, say, 2% of the minority class while your training set has 5%, and your evaluation will be off.

For time-series data, don’t use this. Use TimeSeriesSplit or just slice by date — random splits leak the future into the past.

Cross-validation

A single train/test split is noisy. Cross-validation does it k times, averaging the result:

from sklearn.model_selection import cross_val_score

model = LogisticRegression()
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(scores)            # array of 5 scores
print(scores.mean(), "+/-", scores.std())

cv=5 is the standard. For classification it does stratified k-fold by default. For more control, use StratifiedKFold directly and pass it as cv.

For final hyperparameter tuning, GridSearchCV and RandomizedSearchCV wrap cross-validation around a grid or random sample of parameter combinations:

from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(
    LogisticRegression(max_iter=1000),
    param_grid={"C": [0.01, 0.1, 1.0, 10.0]},
    cv=5,
    scoring="roc_auc",
)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)

In 2026, for serious hyperparameter tuning, most teams use Optuna instead, which does Bayesian search and handles the early-stopping logic better. But GridSearchCV is fine for small problems.

Pipelines: the production pattern

Here’s the part that makes scikit-learn worth learning even if you eventually move to other libraries. A Pipeline chains preprocessing steps with a model into a single object that supports .fit and .predict:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train, y_train)
pipe.predict(X_test)

Why this matters: the pipeline guarantees that the scaler is fit on training data only, then applied to test data. You can’t accidentally leak. It also means the whole thing serializes as one object — pickle.dump(pipe, ...) and you have a deployable artifact.

For different preprocessing on different columns, ColumnTransformer:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features = ["age", "income", "tenure"]
categorical_features = ["country", "plan"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

pipe = Pipeline([
    ("prep", preprocessor),
    ("clf", LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train, y_train)

handle_unknown="ignore" is important — it lets your pipeline survive new categories appearing in production data. Without it, an unseen value crashes predict.

The pipeline-plus-cross-validation combo is the production pattern. Wrap your preprocessing and your model in a Pipeline, pass the pipeline to cross_val_score or GridSearchCV, and you’ve got an honest evaluation that won’t lie about what your model can do on unseen data.

scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(scores.mean())

Picking the right metric

A note before we close: cross_val_score defaults to accuracy for classifiers, and accuracy is almost never what you want on real data. If 5% of your users churn and 95% don’t, the model that always predicts “won’t churn” gets 95% accuracy and is useless. Pick the metric that matches your problem:

roc_auc — when you care about ranking probabilities, not specific thresholds. The default for most binary classification.
average_precision — better than ROC AUC when classes are heavily imbalanced. Focuses on the precision-recall trade-off.
f1 — when you need a single number balancing precision and recall.
neg_log_loss — when probabilities themselves matter, not just rankings (e.g. for downstream calibrated decisions).
neg_mean_squared_error, neg_mean_absolute_error, r2 — for regression. The negation is a scikit-learn convention so that “higher is better” works uniformly.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe, X, y, cv=5, scoring="average_precision")

For classification with imbalanced classes — the common case in fraud, churn, conversion — start with average_precision rather than accuracy. Your model will look worse on paper, and that’s the point: now your numbers reflect reality.

Persisting and reloading

Once you have a pipeline that works, save it. The conventional tool is joblib, which handles NumPy arrays more efficiently than vanilla pickle:

import joblib
joblib.dump(pipe, "model.pkl")

# Later, in a serving process:
pipe = joblib.load("model.pkl")
predictions = pipe.predict(X_new)

Pin your scikit-learn version in pyproject.toml — pickled models can fail to load across major version changes. In production, the standard pattern is to ship the model file with the code that loaded it: same Python version, same library versions, no surprises.

That’s the lesson. Estimators are .fit + .predict. Transformers are .fit + .transform. Wrap both in a pipeline so you don’t leak. Use cross-validation to know what you actually have, and pick a metric that reflects your problem. Next lesson — feature engineering — is where most of the actual gains hide. Lesson 51 — tree-based models, properly — is where most tabular ML problems get solved.