There’s a recurring scene in data work: a junior engineer is asked to model some tabular dataset, they reach for XGBoost, they tune for an afternoon, the model gets 0.847 AUC, and they ship. Then a senior glances at the problem, fits a logistic regression with a couple of well-chosen features, gets 0.831 AUC, and asks: are those sixteen extra thousandths of AUC worth a model whose decisions you can’t explain to the compliance team?
The answer is sometimes yes. Often it’s no. And the only way you’ll know is if you actually have the linear baseline to compare against. This lesson is about the family of models — linear regression, logistic regression, and their regularized cousins — that should be the first thing you fit on any tabular problem, and that surprisingly often end up being the thing you ship.
The shape of a linear model
Linear models predict by taking a weighted sum of features. That’s it. For a continuous target y, linear regression says:
y_hat = w0 + w1*x1 + w2*x2 + ... + wn*xn
Training is finding the weights w0..wn that minimize squared error on the training data. There’s a closed-form solution involving matrix inversion, and if the matrix is well-behaved scikit-learn’s LinearRegression returns it almost instantly. No iterations, no learning rates.
For binary classification, logistic regression wraps that same weighted sum in a sigmoid:
p(y=1 | x) = sigmoid(w0 + w1*x1 + ... + wn*xn)
Training minimizes log-loss (cross-entropy). No closed form; it iterates, but on tabular data of any reasonable size it converges in seconds. Despite the name, logistic regression is a classification algorithm — the “regression” refers to fitting probabilities, not predicting continuous values.
Both models share the same flavor: a vector of coefficients you can read, one per feature.
The interpretability advantage
Here’s what nothing else gives you. After fitting, you can read off model.coef_ and say:
A 1-unit increase in feature X is associated with a
coef_X-unit change in the target (linear regression), or with acoef_X-log-odds change in the probability of the positive class (logistic regression), holding all other features constant.
That sentence — “holding all other features constant” — is doing a lot of work, but it’s still a meaningful statement. Try saying anything that direct about a random forest. Or about a neural network. You can compute SHAP values, sure, but those are post-hoc explanations of a model whose internal decisions are still opaque.
In regulated industries — credit scoring, medical risk, anywhere a regulator might ask “why did your model deny this person” — interpretability isn’t a nice-to-have. It’s the deal. Logistic regression is still the dominant model in credit underwriting in 2026, and not because the data scientists haven’t heard of XGBoost.
A regression example, three ways
Let’s set up a small regression problem and fit three flavors of linear model:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
def fit_and_report(name, model):
pipe = Pipeline([("scale", StandardScaler()), ("model", model)])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)
coefs = dict(zip(X.columns, pipe.named_steps["model"].coef_.round(3)))
print(f"{name:8s} RMSE={rmse:.3f} R2={r2:.3f}")
print(f" coefs: {coefs}")
fit_and_report("Linear", LinearRegression())
fit_and_report("Ridge", Ridge(alpha=1.0))
fit_and_report("Lasso", Lasso(alpha=0.1))
Note the StandardScaler in front. Regularization penalizes coefficient magnitudes, which only makes sense if features are on comparable scales. Forgetting to scale before Ridge or Lasso is a top-three beginner mistake.
When you run this you’ll see RMSEs within a few percent of each other across the three models, but the coefficients look different. LinearRegression uses whatever weights minimize training MSE, no matter how large. Ridge shrinks them all toward zero. Lasso drives several to exactly zero — it’s done feature selection for you.
Regularization, in one paragraph
Plain linear regression on noisy or correlated features overfits. The coefficients grow large to chase noise. Regularization adds a penalty on coefficient magnitudes to the loss function. Two flavors matter:
- L2 (Ridge): penalty is the sum of squared coefficients. Shrinks all coefficients smoothly. Good when you believe many features matter a little. Closed-form solution; very fast.
- L1 (Lasso): penalty is the sum of absolute coefficients. Drives some coefficients to exactly zero, performing feature selection. Good when you believe most features are noise. No closed form; coordinate descent.
- ElasticNet: a weighted mix of L1 and L2. The default sane choice when you don’t know which family fits your data.
The strength is controlled by the alpha hyperparameter. Higher alpha = more shrinkage = simpler model. You tune it, usually with cross-validation. scikit-learn ships RidgeCV, LassoCV, and ElasticNetCV that do the CV inside the fit:
from sklearn.linear_model import RidgeCV, LassoCV
ridge_cv = RidgeCV(alphas=np.logspace(-3, 3, 50))
ridge_cv.fit(X_train, y_train)
print(f"best alpha = {ridge_cv.alpha_}")
The bias-variance tradeoff in one paragraph
Underfit models have high bias: they’re too simple to capture the signal, so they’re systematically wrong in the same direction across datasets. Overfit models have high variance: they’re so flexible they latch onto noise, so the same model trained on a different sample looks very different. Regularization shifts a model from low-bias-high-variance toward higher-bias-lower-variance. The sweet spot — the alpha that minimizes total error on held-out data — is what RidgeCV is hunting for.
When linear wins
A few situations where you should genuinely expect a regularized linear model to be competitive or better than a tree ensemble:
- Small datasets. With a few hundred rows, trees overfit and you can’t tune your way out. Linear models, with a strong regularization prior, are stable.
- Need interpretability. Medical, legal, financial regulation, anywhere the model’s decision must be defensible. Coefficients are the cleanest explanation tool we have.
- Approximately linear relationships. Rare but not extinct. Some physical processes, some economic relationships, some sensor responses. If your scatter plots look like blobs around a line, a linear model is the honest fit.
- Feature count >> sample count. Genomics is the canonical example: 30,000 features, 200 samples. Trees can’t survive that. Lasso can — the L1 penalty is essentially built for the case “most features are noise.”
- Latency-sensitive serving. A linear model’s inference is one dot product. Microseconds. Trees and ensembles are slower, neural nets slower still.
When linear loses
- Tabular data with strong feature interactions. “Income matters more if you’re under 30” — a single coefficient on income can’t capture that. Trees split on age first, then on income, and get the interaction for free. This is why XGBoost dominates Kaggle tabular competitions.
- Strong nonlinearities. Sigmoid-shaped, sawtooth, or bimodal relationships. You can sometimes fix this with polynomial features (
PolynomialFeatures(degree=2)in scikit-learn) or kernel methods, but at that point you’re working harder than just fitting a tree. - Image, audio, text raw data. Don’t even try linear models on raw pixels or word counts beyond a baseline. That’s what neural networks and transformer embeddings are for. Though linear models on top of pre-computed embeddings? That’s still alive and well in 2026.
Generalized linear models for non-Gaussian targets
Linear regression assumes residuals are roughly normal, with constant variance. For some target types that’s wrong, and forcing it produces silly predictions (negative counts, probabilities outside [0,1]). Generalized linear models (GLMs) extend the linear-weighted-sum to other distributions via a link function:
- Logistic regression — Bernoulli target, logit link. (Already covered.)
- Poisson regression — count target (number of events), log link. Use
PoissonRegressorin scikit-learn. - Gamma regression — strictly positive continuous target with right-skewed errors (insurance claim sizes, time-to-event), log link.
GammaRegressor. - Tweedie regression — for targets that mix a point mass at zero with a continuous tail (insurance pure premium).
TweedieRegressor.
from sklearn.linear_model import PoissonRegressor
# Target is number of website visits per day
model = PoissonRegressor(alpha=0.1)
model.fit(X_train, visit_counts_train)
These keep the interpretability of linear models while respecting the actual statistics of your target. If your target is a count, use Poisson, not least-squares. The predictions will be non-negative by construction, the loss will respect that variance grows with the mean, and the coefficients will be on the log scale (a 1-unit feature increase multiplies the predicted count by exp(coef)).
A classification example, briefly
For binary classification the same pattern holds with LogisticRegression. The C parameter is the inverse of regularization strength — small C means heavy regularization. By default scikit-learn uses L2 regularization; pass penalty="l1" (with solver="liblinear" or "saga") for Lasso-style logistic, or penalty="elasticnet" (with solver="saga") for the mixed version.
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
clf = Pipeline([
("scale", StandardScaler()),
("model", LogisticRegressionCV(Cs=20, cv=5, max_iter=5000)),
])
clf.fit(X_train, y_train)
print(f"Test accuracy = {clf.score(X_test, y_test):.3f}")
LogisticRegressionCV does cross-validated tuning of C automatically. On Wisconsin breast cancer — 30 features, 569 samples — this routinely scores 0.97+ accuracy and is fully interpretable. Trees do not meaningfully beat it on this dataset, and a linear model with explicit coefficients is what you’d actually want a clinician to inspect.
Multiclass via the same machinery
LogisticRegression extends to multiclass by training one binary classifier per class (one-vs-rest) or by training a softmax-style model directly (multi_class="multinomial", the default in modern scikit-learn). Either way the API is identical to the binary case — pass a y with more than two unique values and let it figure out the rest. The interpretability story degrades a little: you now have one coefficient per feature, per class, so the explanation becomes “feature X pushes class A up, class B down, class C up” rather than a single number. Still readable, just denser.
The practical pattern
Here’s the workflow I’d argue for on any new tabular problem:
- Fit a regularized linear baseline (RidgeCV for regression, LogisticRegressionCV for classification) on properly scaled features. Note the cross-validated score.
- Fit a default-settings tree ensemble (XGBoost, LightGBM, or RandomForest) on the same data. Note the score.
- Compare. If the tree beats the linear model by 1-2%, ship the linear model. The interpretability, the inference speed, the smaller failure surface — those are worth two points of accuracy in almost any business context outside Kaggle.
- If the tree beats the linear model by 10%+, the problem genuinely has nonlinearities or interactions and you should invest in the tree path: tune it, evaluate it carefully, plan for the operational cost.
The bias here is on purpose. Defaulting to the simpler model is a hedge against the model failure modes you can’t see in your offline evaluation: train/serve skew, distribution drift, feature pipeline bugs that change behavior in ways that complex models hide. A logistic regression that’s wrong is wrong in ways you can read off the coefficients. A 1000-tree gradient-boosted ensemble that’s wrong is a debugging session that ruins your week.
Next lesson: how to systematically search the hyperparameter space when you do reach for the heavier model.
References: scikit-learn linear models documentation (scikit-learn.org/stable/modules/linear_model.html), retrieval 2026-05-01.