A few years ago “doing ML” had a clear shape. You collected labeled data, you cleaned it, you trained a model, you evaluated it, you deployed it behind an API. The whole loop took months, and most of the budget went to the data work. That world still exists, but it’s now a subset of a much messier landscape.
In 2026 the first question on a new project isn’t “what model do I train?” It’s “do I need to train anything at all?” For an enormous and growing class of problems, the right answer is: point a hosted LLM at the input and ask politely. For another class, it’s: fine-tune an open-source model on cheap hardware. For a third, the classical pipeline you learned in module 9 is still correct. Picking among them is the most consequential design decision of the field right now.
This lesson is a decision framework, with code. No new algorithms — just hard-won judgment about which tool comes off the shelf.
The decision tree
Three boxes. Pick the leftmost one your problem fits.
Call a hosted LLM (Claude, GPT-5, Gemini) when:
- The task is something a smart human could do given the input as text, with no special training. Drafting, summarizing, extraction, classification on novel categories, reformatting, light reasoning.
- Latency tolerates 1-3 seconds.
- You don’t need bit-for-bit determinism or strict explainability.
- Volume is low-to-medium — say, < 1M API calls per day, or unpredictable bursty traffic.
- You don’t have or can’t get enough labeled training data to fine-tune meaningfully.
- Cost-per-call (currently somewhere between $0.001 and $0.05 depending on model and length) fits your unit economics.
Fine-tune an open-source model (a Llama, Mistral, Qwen, or domain-specific base) when:
- You have a specialized recurring task with at least a few hundred labeled examples.
- Hosted-LLM costs at your volume make a fine-tuned 7B model on your own GPU obviously cheaper.
- Latency requirements are tight (< 200 ms) — local inference on your hardware beats round-tripping to an API.
- Privacy, data residency, or regulation forbids sending the data off-site.
- The task is narrow enough that a small specialist beats a giant generalist.
Train from scratch (or train a non-LLM ML model from scratch) when:
- You’re building embedding models for retrieval — a smaller, faster, domain-tuned encoder usually wins.
- The task is time-series forecasting, recommendation, anomaly detection, or another tabular/sequence problem where LLMs aren’t the right shape at all. Module 9 still applies here, untouched.
- You’re working in a novel modality (proprietary sensor data, domain-specific images, biological sequences) where no useful pre-trained model exists.
The huge majority of “ML projects” in 2026 land in the first two boxes. A meaningful but smaller fraction stays in the third — and recognizing when classical ML still wins is part of the skill.
The hybrid pattern: LLMs as adapters
The most robust production architectures don’t treat the LLM as the whole system. They use it as a shape-changing adapter that sits at the boundary, taking in messy unstructured input (free-text emails, voice transcripts, scanned documents) and emitting tidy structured data that a classical pipeline downstream can handle.
from anthropic import Anthropic
import json
client = Anthropic()
PROMPT = """You will receive a customer support email. Extract:
- intent: one of [refund, technical_issue, account_question, other]
- urgency: one of [low, medium, high]
- mentions_competitor: boolean
- contains_legal_threat: boolean
Reply with ONLY a JSON object, no prose."""
def classify_email(body: str) -> dict:
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=200,
system=PROMPT,
messages=[{"role": "user", "content": body}],
)
return json.loads(msg.content[0].text)
That output then feeds a deterministic routing system, a SQL warehouse, dashboards, alerts. The unpredictable part — natural-language understanding — is contained behind a typed boundary. The downstream stays the boring, testable, auditable code you’ve spent decades getting good at.
This is the pattern that’s actually winning in production: LLM at the edges, classical engineering in the middle. You get the magic where you need it and predictability where you need that.
RAG: the dominant pattern for “answer questions about my data”
Almost every “build me a chatbot for our docs” request is really a request for retrieval-augmented generation. The LLM doesn’t know your private data; you can’t fit all your data in the context window; fine-tuning isn’t a great way to add knowledge anyway. The fix:
- Chunk your documents.
- Compute an embedding for each chunk.
- Store embeddings in a vector index.
- At query time, embed the user’s question, find the top-k most similar chunks, and stuff them into the prompt.
That’s it. The LLM answers from the retrieved context. Fifty lines of Python:
import os
from pathlib import Path
from sentence_transformers import SentenceTransformer
import chromadb
from anthropic import Anthropic
# 1. Embedding model and vector store
embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
chroma = chromadb.PersistentClient(path="./rag_store")
coll = chroma.get_or_create_collection("docs")
# 2. Index your docs
def chunk(text: str, size: int = 600, overlap: int = 100) -> list[str]:
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i + size])
i += size - overlap
return chunks
def index_folder(folder: str) -> None:
docs, ids, metas = [], [], []
for path in Path(folder).rglob("*.md"):
text = path.read_text(encoding="utf-8")
for j, c in enumerate(chunk(text)):
docs.append(c)
ids.append(f"{path.stem}-{j}")
metas.append({"source": str(path)})
embeddings = embedder.encode(docs, normalize_embeddings=True).tolist()
coll.upsert(ids=ids, documents=docs, embeddings=embeddings, metadatas=metas)
# 3. Query
client = Anthropic()
def answer(question: str, k: int = 4) -> str:
q_emb = embedder.encode([question], normalize_embeddings=True).tolist()
hits = coll.query(query_embeddings=q_emb, n_results=k)
context = "\n\n---\n\n".join(hits["documents"][0])
sources = [m["source"] for m in hits["metadatas"][0]]
msg = client.messages.create(
model="claude-opus-4-7",
max_tokens=600,
system=(
"Answer the question using ONLY the provided context. "
"If the answer isn't in the context, say so."
),
messages=[{
"role": "user",
"content": f"<context>\n{context}\n</context>\n\nQuestion: {question}",
}],
)
return msg.content[0].text + f"\n\nSources: {sorted(set(sources))}"
if __name__ == "__main__":
index_folder("./my_docs")
print(answer("How does our refund policy handle digital goods?"))
That’s a working RAG system. Three components: an embedding model (bge-small-en-v1.5 is excellent and tiny), a vector store (ChromaDB is fine for up to ~1M chunks; for larger, look at Qdrant, Weaviate, or pgvector), and an LLM call that consumes the retrieved context.
Frameworks like LangChain and LlamaIndex wrap all this and add streaming, agents, query rewriting, hybrid search. They’re useful but heavy — for a first version the 50-line direct approach above is often clearer to debug. Add a framework when you actually need its features, not by default.
What you tune in a real RAG system, in rough order of impact:
- Chunking strategy. Naive fixed-size splits work but recursive splitters that respect sentence and paragraph boundaries work better. Markdown-aware splitters for markdown.
- Retrieval. Top-k pure-cosine retrieval is the baseline. Hybrid search (BM25 + dense) beats it. Reranking with a cross-encoder (e.g.,
bge-reranker-v2-m3) on the top 20-50 candidates is another big lift. - Embedding model quality. A modern English embedder is fine; for multilingual or domain-specific corpora, swap to a model trained on the right data.
- Prompt template. Tell the model what counts as the context, what to do when the answer isn’t present, what format you want.
- The LLM itself. Switch to a bigger model only after the above are tuned — the retrieval is usually the bottleneck, not the generator.
The cost math
A common decision-driving question: at what volume does fine-tuning your own model beat paying per-call?
A rough calculation. Suppose your task averages 800 input tokens and 300 output tokens per call. With a hosted frontier-tier model at, say, $5/M input + $25/M output tokens, each call costs:
(800 / 1_000_000) * 5 + (300 / 1_000_000) * 25
= 0.0040 + 0.0075
= $0.0115 per call
At 1M calls/month: $11,500/month. At 10M calls/month: $115,000/month.
A self-hosted fine-tuned 7B model on a single A100 instance (currently ~$1.50/hour reserved) costs ~$1,100/month. It can comfortably do millions of calls if your latency budget allows it. The break-even is somewhere around 100K-200K calls per month, if a fine-tuned 7B is good enough at your task.
So for low/medium-volume work, hosted is cheaper and better quality. For high-volume specialized work, owning the model wins. The crossover keeps shifting — hosted prices drop ~30% per year while open weights get ~30% better at the same size — so re-run the calculation every six months. The decision isn’t sticky.
When classical ML still wins, and it’s more often than people think
A loud part of the industry behaves as if everything is now an LLM. Don’t. Numerical, tabular, and time-series problems still beat LLMs on accuracy, latency, cost, and explainability — usually by orders of magnitude on each axis.
Specifically:
- Predicting churn from a customer-features table. Gradient-boosted trees, every time. (See lesson 54.)
- Demand forecasting / time series. Prophet, ARIMA, or a deep model trained specifically on time series. An LLM is a clown shoe for this.
- Anomaly detection on metrics. Isolation forests, statistical control charts, dedicated detectors.
- Recommendation systems. Two-tower models, matrix factorization, learned-to-rank. LLMs are sometimes used as re-rankers but never as the core retriever.
- Computer vision on a fixed taxonomy. Fine-tuning a ConvNeXt or ViT beats VLM prompting on cost and accuracy.
The rule of thumb: when the input is naturally a vector of numbers or a structured row, classical ML usually wins. When the input is naturally a paragraph of free-form text or an image you’d describe in words, an LLM has the edge. Most data engineering work lives in the first category.
The principle: LLMs as a force multiplier, not a replacement
The teams getting the most value out of AI in 2026 aren’t the ones replacing every component with an LLM. They’re the ones who used to take three weeks to build a pipeline and now do it in three days because the LLM took care of the messy text-shaped pieces — the parsing, the classification, the summarization — that used to require hand-tuning.
The engineering didn’t go away. The data work didn’t go away. The need for tests, monitoring, schemas, and version control didn’t go away. What went away was a specific category of friction at the boundary between unstructured human input and structured machine processing. That category was huge. Removing it shifts what’s possible. It does not remove the need for software engineering.
If you take one thing from this lesson: the question on every new project is no longer “what model do we build?” It’s “where does this problem actually live on the spectrum from prompt to fine-tune to from-scratch?” Picking correctly takes you from “can’t be done in our budget” to “ships next quarter.” Picking incorrectly burns money you didn’t need to spend, in either direction.
What’s next
This is the second-to-last lesson of the course. Module 10 has covered the fundamentals of deep learning (lessons 56-57), the transfer-learning + Hugging Face workflow (lesson 58), and the AI-vs-ML decision (this one). The next lesson is the capstone — a look back at what you’ve built across the 60 lessons, and a look forward at where to go next.
References: Anthropic Python SDK (https://docs.anthropic.com/en/api/client-sdks), OpenAI Python SDK (https://platform.openai.com/docs/libraries), sentence-transformers (https://www.sbert.net/), ChromaDB documentation (https://docs.trychroma.com/), BAAI BGE embedding models on Hugging Face, LlamaIndex documentation (https://docs.llamaindex.ai/). Retrieval 2026-05-01.