Data & System Architecture, from the ground up Lesson 79 / 80

ML platform architecture: feature store, model registry, serving

The five layers a modern ML platform standardised on, the train-serve skew problem the feature store was invented to solve, and the build-versus-buy calculus for each layer in 2026.

The previous lessons in Module 10 walked through the cross-cutting concerns that any mature platform has to handle: multi-region deployment, high availability, compliance. This lesson takes the same lens to the ML platform. Five years ago a “machine learning platform” meant whatever shape the largest team in the company had bolted together out of Python scripts, S3 buckets, and a Jenkins job that kicked off a training run. By 2026 the architecture has standardised. The vocabulary is shared, the layers are recognisable across companies, and the build-versus-buy question for each layer has a defensible answer for most teams.

This lesson is the architectural tour of those layers. The Python course Module 9 covers the modelling work itself; Module 10 covers the deep learning specifics. The architectural concern is different. It is not “which model to train” but “what is the substrate that lets a team train, deploy, monitor, and retrain models without each step being a one-off heroic effort”. With that substrate, models become a normal kind of software artefact, with a predictable pipeline from idea to production.

The five layers

flowchart LR
    DW[(Data warehouse / lakehouse)] --> FS_off[Feature store: offline]
    FS_off --> Train[Training pipeline]
    Train --> Track[Experiment tracking]
    Track --> Reg[Model registry]
    Reg --> Serve[Serving: real-time / batch / streaming]
    FS_on[Feature store: online] --> Serve
    DW --> FS_on
    Serve --> Mon[Monitoring: drift, decay, latency]
    Mon -.retraining trigger.-> Train

The flow reads left to right. The data warehouse or lakehouse from Module 5 is the upstream source of truth. Features are computed once and stored in two faces of the feature store: an offline face for training and a low-latency online face for inference. Training pipelines pull offline features, run experiments tracked in a system that captures code, data, parameters, and metrics, and produce model artefacts that are registered with version metadata. Serving infrastructure consumes the registry plus the online feature store and exposes predictions to applications. Monitoring closes the loop, watching for input drift and output decay, and feeding signals back into retraining.

Each box is a layer with a vocabulary, a set of standard tools, and a build-versus-buy calculus.

Feature store

The feature store is the architectural innovation that made the rest of the stack possible to reason about. Its job is to centralise the computation, storage, and serving of features so that the same feature is computed the same way in training and in production.

The bug it was built to solve has a name: train-serve skew. The data science team computes a feature in a notebook against the warehouse, with a particular SQL definition, a particular handling of nulls, a particular timezone for the date arithmetic. The model is trained, evaluated, and approved. The engineering team then re-implements the same feature in a production service, in a different language, with subtly different null handling. The model in production sees inputs that look almost but not quite like the training data, and the predictions degrade in ways that are hard to detect because the model is still returning numbers that look reasonable. The architectural fix is to move the feature computation into a single layer that both training and serving consume.

A feature store typically has two faces:

  • Offline store. The historical record of feature values. Backed by a warehouse or lakehouse table, optimised for batch reads at training time. The training pipeline asks for “feature X for these entity ids at these timestamps” and gets a dataframe back, with point-in-time correctness preserved (the value of the feature as it was known at the requested timestamp, not as it is known now).
  • Online store. The latest value of each feature, indexed by entity id, optimised for low-latency lookups at inference time. Backed by a key-value store (lesson 18) like Redis, DynamoDB, or a purpose-built service.

The feature definition is the contract between the two. A single feature definition produces both the historical batch fill into the offline store and the streaming or scheduled refresh of the online store, so the training pipeline and the serving pipeline are guaranteed to see the same logic.

The major options as of 2026: Feast (the dominant open-source feature store, lightweight and framework-agnostic), Tecton (managed, from the team that built Uber’s Michelangelo, with stream feature engineering as a first-class concern), Databricks Feature Store (bundled with Databricks and integrated with Unity Catalog and MLflow), and AWS SageMaker Feature Store (the AWS-native option). The build-versus-buy math matches lesson 71. A team with two data scientists and one ML engineer should not build a feature store; they should adopt Feast or whatever their cloud vendor offers. A team running thousands of models with stream features and low-latency requirements may genuinely need to build, or buy a heavier option like Tecton.

Experiment tracking

Every training run produces an experiment: a combination of code at a particular commit, data at a particular snapshot, hyperparameters, training metrics, evaluation metrics, and the resulting model artefact. Experiment tracking is the layer that captures all of that, indexes it, and makes it searchable.

The reason it is a platform concern and not a notebook concern is reproducibility at the team level. Without tracking, “the model we shipped last quarter” is a binary on someone’s laptop with a vague memory of how it was trained. With tracking, the same model is a row in a database with links to the exact code, data, and metrics, and reproducing it is a button click.

MLflow’s tracking component has become the open-source standard. Weights and Biases is the popular commercial alternative, with stronger collaboration features. Neptune and Comet occupy similar niches. Cloud-vendor offerings (SageMaker Experiments, Vertex AI Experiments) cover the basics for teams already inside one cloud. The tracker has to scale to the run volume the team produces and has to integrate with the feature store and registry on either side; most teams underinvest here until the day a tracker outage blocks the entire ML pipeline.

Model registry

The registry is the source of truth for what models exist, what versions of each, what their lineage is, and which version is currently deployed where. A model in the registry is a tuple: name, version, training run reference, signature, metrics, stage (development, staging, production, archived), and metadata about the training data and code.

MLflow Model Registry is again the de facto open-source standard, and the cloud vendors all have their own. The architectural value is the same as for any source of truth: every other layer of the platform refers to the registry rather than a path on disk. Serving asks the registry for “the production version of model X”. Monitoring asks for “the metrics that were promised at training time, so I can compare them to what we are seeing now”. The registry is also where governance lives: a model promoted to production has to have passed checks on training data lineage, evaluation thresholds, model-card completeness, and deployment plans, and the registry is where those checks are enforced and recorded.

Serving

Serving is where the model meets the application. The architectural choices fall into three patterns, each with its own infrastructure shape.

  • Real-time inference. The model is wrapped in a service that exposes a REST or gRPC endpoint. An application sends a feature vector or a request that the service enriches with online-store features, and the service returns a prediction within tens of milliseconds. This is the pattern for personalisation, fraud detection, search ranking, and any user-facing surface.
  • Batch inference. The model runs on a schedule against a large input dataset and writes the predictions to a warehouse table or object store. This is the pattern for scoring all customers nightly, generating recommendations for an email campaign, or any use case where the latency budget is hours rather than milliseconds.
  • Streaming inference. The model consumes events from a Kafka topic (lesson 42), enriches them with online features, makes a prediction, and writes the prediction back to another topic. This is the pattern for use cases where decisions need to be made as events arrive but the consumer is itself a downstream system rather than a user.

Tools span a wide range. SageMaker, Vertex AI, and Azure ML are the cloud-managed options. BentoML and KServe are open-source serving frameworks that fit into Kubernetes (lesson 55). Seldon and Triton occupy similar niches, with Triton particularly strong for GPU inference. The non-obvious architectural point: the serving layer is application infrastructure, not data infrastructure. Its uptime, latency, and capacity requirements line up with the application it serves, not with the warehouse it was trained against, and the boundary between data-team and application-team ownership has to reflect that.

Monitoring

Monitoring closes the loop. There are three distinct concerns.

Input drift. The distribution of inputs to the model changes over time. The customers signing up today are not the customers from when the model was trained. The features have shifted, and the model’s predictions on the shifted distribution may be worse than its predictions were on the training distribution. Drift detection runs statistical tests on the input features against the training distribution and alerts when the divergence crosses a threshold.

Output decay. The model’s predictions are still in the right shape, but the quality has degraded. This requires ground truth, which often arrives late (the prediction was about whether a user would churn; the ground truth is whether they actually churned, observable thirty days later). Output decay monitoring compares predictions against ground truth on a rolling window and alerts when accuracy, precision, recall, or another business-relevant metric drops.

Performance. Latency, throughput, error rate. The same observability concerns the rest of the platform has (lesson 59), with the additional concern that an ML model’s latency tends to be more variable than a typical service’s, because the input distribution affects the inference path.

Tools dedicated to ML observability include Arize, WhyLabs, Evidently (open-source), and Fiddler. Some teams build the basics on top of their existing observability stack (Prometheus, Datadog, Grafana) and only adopt a dedicated tool when the volume justifies it. The architectural principle is the same as elsewhere: the monitoring layer has to feed back into the retraining pipeline, so that drift or decay automatically opens a ticket, triggers a retraining run, or in the most automated setups, schedules a model refresh.

The build-versus-buy calculus, again

The pattern from lesson 71 repeats here, layer by layer. A small team running a handful of models should buy or adopt managed tools across the entire stack: a cloud vendor’s feature store, MLflow for tracking and registry, the cloud’s serving infrastructure, an open-source monitoring tool. Buying is the default; the maintenance burden of running the platform yourself is heavier than it looks, and the cost of vendor lock-in at small scale is small.

A team running hundreds of models with low-latency serving and stream features starts to see the calculus shift, and a team at the scale of Uber or Netflix has built the whole thing, because at their scale even the managed offerings are not built for them. The Michelangelo platform Uber published about (lesson 48) and Netflix’s Metaflow ecosystem are the case-study artefacts. The lesson is not “build your own ML platform”. The lesson is that the layers are real, the vocabulary is shared, and the architectural shape is recognisable across the industry. A team that knows the layers can read a job posting, an engineering blog, or a vendor pitch and understand what is being described, which is the bar this course aims at.

Cross-references

The ML platform sits on top of nearly everything Modules 5 through 9 introduced. The data warehouse and lakehouse from lessons 33 to 40 are the upstream of the feature store. The streaming infrastructure from lessons 41 to 48 is what stream features ride on. The orchestration layer from lessons 57 and 58 schedules training and batch inference. The observability and SLO practices from lessons 59 and 60 apply directly to the serving and monitoring layers. The cost optimisation work from Module 9 is what keeps the GPU bill from doubling every quarter. The next lesson is the course capstone.

Citations and further reading

  • Feast project, https://feast.dev/ (retrieved 2026-05-01). The dominant open-source feature store.
  • MLflow project, https://mlflow.org/ (retrieved 2026-05-01). The de facto open-source standard for tracking and registry.
  • Tecton, https://www.tecton.ai/ (retrieved 2026-05-01). Managed feature store from the Michelangelo team.
  • Uber Engineering, “Meet Michelangelo: Uber’s Machine Learning Platform”, https://www.uber.com/blog/michelangelo-machine-learning-platform/ (retrieved 2026-05-01). The case study that put feature stores on the industry map.
  • Netflix Tech Blog, “Metaflow: Open-source framework for real-life ML”, https://netflixtechblog.com/ (retrieved 2026-05-01). The other major published ML platform.
  • Chip Huyen, “Designing Machine Learning Systems”, O’Reilly, 2022. The textbook reference for the architectural concerns this lesson covers.
  • Evidently AI documentation on ML monitoring, https://docs.evidentlyai.com/ (retrieved 2026-05-01). The open-source monitoring tool worth knowing even if a team uses something else.
Search