Pre-trained models + transfer learning + Hugging Face

In 2017 a deep learning project meant downloading ImageNet, renting four GPUs for a week, and training a ResNet from scratch. In 2026 it means typing from_pretrained("...") and being done in twenty minutes. The shift isn’t a small optimization. It’s the entire shape of the field.

The reason is transfer learning. Somebody else already burned a million GPU-hours teaching a model to understand images, or English, or code. You take that model, swap a few layers near the output, and fine-tune on your tiny dataset. The lower layers of the network — the ones that learned to detect edges or syntax tokens — already know things that transfer for free. You’re not teaching the model to see; you’re teaching it which categories you care about.

Lesson 57 walked through training a small network end-to-end. This lesson is the version that actually maps to your day job.

The intuition: layers learn at different abstraction levels

A trained convolutional network’s early layers respond to edges and color blobs. The middle layers respond to textures and parts of objects. The late layers respond to whole concepts — “this looks like a dog, this looks like a car.” If you visualize the filters of layer 1 of any image classifier trained on a real-world dataset, they look basically the same: oriented edges, color gradients. Generic visual features.

The same is true for language models. The early transformer layers handle local syntax, common bigrams, morphology. Middle layers track sentence structure. The top layers carry task-specific signal — for a model trained on Wikipedia, that’s “is this an entity, what kind of entity, how does it relate to other entities.”

Transfer learning exploits the asymmetry: generic-feature layers don’t need re-training when you switch tasks, but task-specific layers do. So you keep the body of the model and replace the head.

In code, the pattern is roughly:

Load a pre-trained model.
Optionally freeze the lower layers (requires_grad = False).
Replace the final classifier with one sized for your number of classes.
Train at a low learning rate so you don’t blow away what’s already there.

Hugging Face: the model hub

In 2026, when somebody says “the model” they almost always mean “a checkpoint on Hugging Face.” The Hub hosts hundreds of thousands of models — every flavor of BERT, every Llama variant, every fine-tune of Stable Diffusion, every audio and vision model. The Python client is the transformers library and its companions datasets, tokenizers, accelerate, and peft.

The boilerplate is identical across model families:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=2,
)

That’s it. The Auto* classes look at the config of the checkpoint and instantiate the right concrete class. You can swap distilbert-base-uncased for roberta-base or bert-base-multilingual-cased and not change a line of training code. This is the gift of the library.

A real fine-tune: DistilBERT on a sentiment dataset

Let’s fine-tune a sentiment classifier on the IMDB reviews dataset. Two classes, ~50,000 reviews. The full dance, end to end.

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
import numpy as np
import evaluate

ds = load_dataset("imdb")
print(ds)
# DatasetDict({ train: 25000, test: 25000, unsupervised: 50000 })

# Sub-sample so the example runs in minutes, not hours
ds["train"] = ds["train"].shuffle(seed=42).select(range(2000))
ds["test"]  = ds["test"].shuffle(seed=42).select(range(500))

model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=256)

ds_tok = ds.map(tokenize, batched=True)
collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(
    model_id, num_labels=2,
)

accuracy = evaluate.load("accuracy")

def metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=preds, references=labels)

args = TrainingArguments(
    output_dir="./distilbert-imdb",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_steps=50,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tok["train"],
    eval_dataset=ds_tok["test"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=metrics,
)

trainer.train()
trainer.evaluate()

On a free Colab T4 GPU this trains in maybe four minutes. You’ll get accuracy in the high 80s out of the box. To see what’s happening, walk through the parts:

load_dataset("imdb") streams the dataset from the Hub. First call caches it locally; subsequent calls are instant.
tokenizer(...) turns text into input_ids and attention_mask. max_length=256 truncates long reviews — you can go to 512 for better accuracy at higher cost.
DataCollatorWithPadding pads each batch to the longest sequence in that batch rather than the global max. Cheap speedup.
AutoModelForSequenceClassification swaps the pre-trained DistilBERT’s masked-language-modeling head for a randomly initialized 2-class classifier head. The body is pre-trained; only the head is fresh.
Trainer wraps the boilerplate from lesson 57 — the optimizer, learning-rate scheduler, gradient accumulation, mixed precision, checkpointing, logging. You configure the knobs through TrainingArguments and stop writing the training loop yourself.

Two epochs at lr=2e-5 is the standard recipe for fine-tuning a base-size BERT model. Going higher tends to break the pre-trained weights; going lower undertrains the head.

Saving and reloading

trainer.save_model("./distilbert-imdb-final")
tokenizer.save_pretrained("./distilbert-imdb-final")

# Use it later:
from transformers import pipeline
clf = pipeline("sentiment-analysis", model="./distilbert-imdb-final")
clf("This movie was beautifully shot but the script dragged.")
# [{'label': 'LABEL_1', 'score': 0.78}]

You can also push it directly to the Hub:

trainer.push_to_hub("your-username/distilbert-imdb")

That makes the checkpoint shareable. Six months later, anyone (including you) can re-load the exact model in three lines.

PEFT and LoRA: fine-tune cheaply

Full fine-tuning of a 7-billion-parameter model needs serious hardware — every weight gets a gradient and an optimizer state, easily tripling memory. Parameter-efficient fine-tuning (PEFT) sidesteps the issue: instead of updating all the weights, you train tiny trainable adapters and freeze the rest.

The dominant technique is LoRA — Low-Rank Adaptation. For every weight matrix you want to adapt, you add a pair of low-rank matrices A and B such that W' = W + B @ A where A and B are tiny (rank 8 or 16) compared to W. You only train A and B. The original weights stay frozen.

The numbers are dramatic. Fine-tuning a 7B model with LoRA might train ~10 million parameters instead of 7 billion. Memory drops by 5-10x. Quality stays competitive.

The Hugging Face PEFT library makes this a few-line change:

from peft import LoraConfig, get_peft_model, TaskType

lora_cfg = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_lin", "v_lin"],  # which matrices to adapt
)

base = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
model = get_peft_model(base, lora_cfg)
model.print_trainable_parameters()
# trainable params: 296,450 || all params: 67,251,202 || trainable%: 0.44

Drop that model into the same Trainer as before. You’re fine-tuning < 0.5% of the parameters and getting most of the accuracy. For LLMs this is the default workflow now — a 24 GB consumer GPU can fine-tune a 7B model with LoRA + 4-bit quantization, something that would have required a small data center two years ago.

Vision and multimodal: same pattern

The pattern is identical for other modalities, just with different Auto* classes:

# Computer vision — fine-tune a ViT for image classification
from transformers import AutoImageProcessor, AutoModelForImageClassification
processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=10, ignore_mismatched_sizes=True,
)

For pure-vision projects, the timm library (now part of Hugging Face) is the gold standard — every interesting vision backbone, swappable behind one API. For multimodal (image + text) tasks, models like CLIP, SigLIP, and the LLaVA family are all from_pretrained away.

Should you fine-tune at all?

In 2026 the question often isn’t “fine-tune or train from scratch.” It’s “fine-tune, or just call an LLM with a good prompt?” That’s the next lesson. Spoiler: for a lot of NLP tasks, prompting a hosted model gets you 90% of the way there in zero training time, and only when prompting + retrieval clearly fall short do you reach for fine-tuning.

The mental model:

Prompt a hosted LLM. Lowest cost to experiment, no training data needed, expensive at scale.
Fine-tune an open-source model with LoRA. Mid-cost, mid-effort, you own the weights, fits a known task tightly.
Train from scratch. Reserved for embeddings, time series, recommender systems, novel modalities, or “we have a billion examples and a research budget.”

The middle row used to be the default for any non-trivial project. In 2026 it’s the second-line option, picked when prompting clearly isn’t enough.

What’s next

You’ve now covered the practical 2026 deep-learning stack: tensors and autograd (lesson 56), a hand-rolled training loop (lesson 57), and Hugging Face transfer learning (this one). The next lesson zooms out and asks the bigger question: when is the right tool an LLM call, when is it a fine-tuned model, and when is it a classical ML pipeline? It’s the most consequential design question of the field right now, and the cost of getting it wrong is real.

References: Hugging Face transformers documentation (https://huggingface.co/docs/transformers), datasets documentation (https://huggingface.co/docs/datasets), PEFT documentation (https://huggingface.co/docs/peft), Hu et al. “LoRA: Low-Rank Adaptation of Large Language Models” (2021), timm library (https://huggingface.co/docs/timm). Retrieval 2026-05-01.