The training loop, in code · Narcis Miclaus

A neural network straight out of nn.Module is a random function. Its weights are initialized to small random numbers; its predictions are noise. The training loop is what turns that random function into a useful one. The conceptual core is five lines of code. The production version is more like 80, because real training requires checkpointing, validation, metric tracking, mixed precision, gradient clipping, and a learning rate schedule. By the end of this lesson you will have written both, and you’ll know when each is appropriate.

The five lines

Every PyTorch training step ever written boils down to these five lines:

optimizer.zero_grad()                 # 1. clear gradients from last step
outputs = model(inputs)               # 2. forward pass: compute predictions
loss = criterion(outputs, targets)    # 3. compute the loss
loss.backward()                       # 4. backward pass: compute gradients
optimizer.step()                      # 5. update parameters

That is the entire training algorithm. Everything else is iteration and bookkeeping. Wrap those five lines in a loop over batches from your DataLoader, then wrap that in another loop over epochs (one full pass through the dataset), and you have a complete training script.

for epoch in range(n_epochs):
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

That’s it. That is a working training loop. It will train your model. It is also missing every piece of bookkeeping you need to actually trust the result, ship it, and recover from disasters. We’ll add those next.

Validation

You cannot judge a model by its training loss. The training loss tells you how well the model has memorized the training set, which is not what you care about. You care about how well it generalizes to data it has never seen. So at the end of every epoch (or every N batches, for very long epochs), you evaluate on a held-out validation set.

Two things change in evaluation. First, you call model.eval(), which switches layers like Dropout and BatchNorm into evaluation mode. Second, you wrap the loop in torch.no_grad() to disable gradient tracking — it’s not needed, and disabling it saves memory and time.

def evaluate(model, loader, criterion, device):
    model.eval()
    total_loss = 0.0
    n_correct = 0
    n_total = 0
    with torch.no_grad():
        for inputs, targets in loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)
            total_loss += loss.item() * inputs.size(0)
            preds = outputs.argmax(dim=1)
            n_correct += (preds == targets).sum().item()
            n_total += inputs.size(0)
    model.train()
    return total_loss / n_total, n_correct / n_total

Always remember to call model.train() to put it back into training mode after evaluating. Forgetting this is a classic bug — your dropout stays disabled for the rest of training and your model overfits.

Checkpointing

Training runs fail. The cluster reboots. You discover a bug after epoch 47 of 100 and want to roll back to epoch 30. You realize you need to compare two models from different points in training. None of this is possible if you don’t save model state. Save checkpoints.

import torch
from pathlib import Path

CKPT_DIR = Path("checkpoints")
CKPT_DIR.mkdir(exist_ok=True)

def save_checkpoint(model, optimizer, epoch, val_loss, path):
    torch.save({
        "epoch": epoch,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "val_loss": val_loss,
    }, path)

def load_checkpoint(path, model, optimizer=None):
    ckpt = torch.load(path, map_location="cpu", weights_only=True)
    model.load_state_dict(ckpt["model_state_dict"])
    if optimizer is not None:
        optimizer.load_state_dict(ckpt["optimizer_state_dict"])
    return ckpt["epoch"], ckpt["val_loss"]

Two patterns I always use: save the latest checkpoint at the end of every epoch (so I can resume after a crash), and save the best checkpoint whenever validation loss improves (so I have the model that performed best, not just the one from the last epoch). Don’t keep all checkpoints — they’re huge and you don’t need them.

Metric tracking

Print statements work for tiny experiments. For anything serious, use a real tracker. The two dominant tools in 2026:

TensorBoard is built into PyTorch via torch.utils.tensorboard.SummaryWriter. Free, local, low-friction.
Weights & Biases (wandb) is the cloud-hosted standard. Tracks metrics, system stats, hyperparameters, code versions. The free tier is generous; the team plan is what most labs use.

I use wandb for collaborative work and TensorBoard for solo runs. The interface is similar:

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("runs/exp1")

# inside training loop:
writer.add_scalar("train/loss", loss.item(), global_step)
writer.add_scalar("val/loss", val_loss, epoch)
writer.add_scalar("val/accuracy", val_acc, epoch)

Even if you never use a UI, dumping metrics to a JSON file as you go is non-negotiable. You will want to plot training curves later. You will not remember the exact numbers. Save them.

Mixed precision

Modern NVIDIA GPUs (V100 onwards, anything you’d use in 2026) compute much faster in float16 or bfloat16 than in float32. Mixed-precision training keeps weights in float32 for stability but does the forward and backward pass in lower precision, with a “gradient scaler” to prevent underflow. The benefit is roughly 2x training speedup for free, plus halved GPU memory usage.

from torch.cuda.amp import GradScaler, autocast

scaler = GradScaler()

for inputs, targets in train_loader:
    inputs, targets = inputs.to(device), targets.to(device)
    optimizer.zero_grad()
    with autocast(device_type="cuda", dtype=torch.float16):
        outputs = model(inputs)
        loss = criterion(outputs, targets)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Almost no reason not to use mixed precision in 2026. The PyTorch API is stable, the speedup is real, and modern GPUs are increasingly designed around it.

Gradient clipping and learning rate scheduling

Two more pieces you’ll see in any production training script.

Gradient clipping prevents the occasional huge gradient from blowing up your weights. Right after loss.backward() and before optimizer.step():

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

This rescales the gradient vector so its L2 norm is at most 1.0. Essential for training transformers; useful insurance for everything else.

Learning rate scheduling: a constant learning rate is rarely optimal. The standard recipe in 2026 is a brief linear warmup followed by cosine decay. PyTorch ships several schedulers in torch.optim.lr_scheduler:

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer, T_max=n_epochs
)
# at the end of each epoch:
scheduler.step()

CosineAnnealingLR decays from your initial LR down to zero following a cosine curve. OneCycleLR and ReduceLROnPlateau are the other common choices.

Putting it together: a real training loop

Here is the production-grade version. About 80 lines. This is the kind of file every deep learning project has.

import json
import torch
import torch.nn as nn
from pathlib import Path
from torch.cuda.amp import GradScaler, autocast

def train(
    model: nn.Module,
    train_loader,
    val_loader,
    criterion,
    optimizer,
    scheduler,
    n_epochs: int,
    device,
    ckpt_dir: Path,
    use_amp: bool = True,
    grad_clip: float = 1.0,
):
    ckpt_dir.mkdir(exist_ok=True, parents=True)
    scaler = GradScaler(enabled=use_amp)
    best_val_loss = float("inf")
    history = []

    for epoch in range(n_epochs):
        # --- training ---
        model.train()
        train_loss = 0.0
        n_seen = 0
        for inputs, targets in train_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()
            with autocast(device_type=device.type, enabled=use_amp):
                outputs = model(inputs)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
            scaler.step(optimizer)
            scaler.update()
            train_loss += loss.item() * inputs.size(0)
            n_seen += inputs.size(0)
        train_loss /= n_seen
        scheduler.step()

        # --- validation ---
        val_loss, val_acc = evaluate(model, val_loader, criterion, device)
        lr = optimizer.param_groups[0]["lr"]
        history.append({
            "epoch": epoch,
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_acc": val_acc,
            "lr": lr,
        })
        print(f"epoch {epoch:3d} | lr {lr:.2e} | "
              f"train {train_loss:.4f} | val {val_loss:.4f} | acc {val_acc:.4f}")

        # --- checkpoint ---
        torch.save({
            "epoch": epoch,
            "model_state_dict": model.state_dict(),
            "optimizer_state_dict": optimizer.state_dict(),
        }, ckpt_dir / "latest.pt")
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), ckpt_dir / "best.pt")

    with open(ckpt_dir / "history.json", "w") as f:
        json.dump(history, f, indent=2)
    return history

Boilerplate-heavy but every line earns its place. Validation is non-negotiable. Checkpointing is non-negotiable. The history JSON lets you plot training curves after the fact even if your TensorBoard logs got nuked.

Distributed training, briefly

If your model and data fit on one GPU, single-GPU training is the right choice. If they don’t, you need data-parallel or model-parallel training across multiple GPUs.

The standard mechanism in PyTorch is torch.nn.parallel.DistributedDataParallel (DDP). Each GPU runs an identical copy of the model on a different shard of each batch; gradients are averaged across GPUs after the backward pass. You launch the script with torchrun --nproc_per_node=N script.py and PyTorch handles the synchronization. The changes to your training code are minimal — wrap your model in DDP(...), use a DistributedSampler on your DataLoader, and only save checkpoints on rank 0.

For models too big to fit on a single GPU even at batch size 1 — frontier transformers, mostly — you need fully sharded data parallel (FSDP) or pipeline parallelism. That is its own course. Don’t go there until you have to.

PyTorch Lightning: skip the boilerplate

PyTorch Lightning wraps the training loop pattern above into a class-based API. You define a LightningModule with training_step, validation_step, and configure_optimizers methods, hand it to a Trainer, and Lightning handles checkpointing, AMP, distributed training, and metric logging for you.

import lightning as L

class LitMLP(L.LightningModule):
    def __init__(self, model, lr=1e-3):
        super().__init__()
        self.model = model
        self.criterion = nn.CrossEntropyLoss()
        self.lr = lr

    def training_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self.model(inputs)
        loss = self.criterion(outputs, targets)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self.model(inputs)
        loss = self.criterion(outputs, targets)
        acc = (outputs.argmax(dim=1) == targets).float().mean()
        self.log("val_loss", loss)
        self.log("val_acc", acc)

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.lr)

trainer = L.Trainer(
    max_epochs=20,
    precision="16-mixed",
    accelerator="auto",
    devices="auto",
)
trainer.fit(LitMLP(model), train_loader, val_loader)

About 30 lines, replacing 80 of raw PyTorch. For straightforward training problems, Lightning is a real productivity win. The cost is opacity — when something weird happens, you have to dig through Lightning’s internals to understand it.

Hugging Face Trainer: for transformers

For fine-tuning transformer models from Hugging Face Hub, the transformers.Trainer class is the standard. It handles tokenization, dynamic padding, mixed precision, distributed training, and integrates with Hugging Face’s datasets library. If you’re fine-tuning a BERT, a Llama, or any model from the Hub, you almost always use Trainer instead of writing your own loop.

Roll-your-own vs framework

The honest decision rule:

Roll your own when you’re learning (you cannot understand deep learning by writing only LightningModule.training_step), when you have unusual training requirements, or when you’re doing research that needs full control.
Use Lightning for production training of standard architectures where the boilerplate is pure overhead.
Use Hugging Face Trainer for fine-tuning models from Hugging Face Hub.

I write a from-scratch training loop for almost every new project at first, because it forces me to think about every piece. Then I refactor to Lightning if the project sticks around and the code is becoming a maintenance burden. The five lines at the start of this lesson are the core; the rest is choosing how much you want to write yourself.

End of Module 10

That’s the foundation. Lesson 55 was the intuition: a network is a function, backprop is just the chain rule, deep learning wins where feature engineering is impossible. Lesson 56 was PyTorch: tensors, autograd, the nn.Module API. Lesson 57, this one, was the loop: five lines of core, plus the bookkeeping. With these three lessons in your head, you can read any modern deep learning paper’s reference implementation and recognize what every block is doing. Module 11 builds on this with a small image classification project, and from there we move into the more specialized territory of CNNs, transformers, and using pretrained models from Hugging Face.

The first deep learning project is always slow. The second is fast. By the third you’ll be tweaking learning rate schedules over coffee. Welcome.