Glossary

A short index of the words I keep using across the lessons. Each entry is a one-paragraph explanation in plain language. If a term shows up in a lesson and the lesson does not stop to define it, it is probably here.

80 terms

Databases

ACID Databases: Atomicity, Consistency, Isolation, Durability. The four guarantees a classic relational transaction gives you: it either fully happens or fully doesn't, it leaves the database in a valid state, it doesn't see other half-finished work, and once it commits it survives a crash.
BASE Databases: Basically Available, Soft state, Eventually consistent. The opposite philosophy from ACID, common in distributed NoSQL stores: the system stays up under partitions, the state may drift between replicas for a while, and reads converge once the cluster catches up.
Columnar storage Databases: Storing each column of a table contiguously on disk instead of each row. Pays off massively for analytical queries that touch a few columns over millions of rows.
OLAP Databases: Online Analytical Processing. Systems built to chew through huge scans, aggregations, and joins for reports and dashboards. Usually columnar, often append-only, and slow on single-row updates by design.
OLTP Databases: Online Transaction Processing. Systems that handle many small, latency-sensitive operations: read your account balance, place an order, update a row. Typically row-oriented and tuned for concurrency.
Polyglot persistence Databases: Using more than one type of database in the same product because no single store fits every workload: maybe a relational DB for orders, a key-value cache for sessions, and a search engine for full-text queries.
Replication Databases: Keeping more than one copy of the same data across different machines. Buys you durability, read scaling, and failover; costs you the headache of keeping replicas in sync.

Distributed systems

Bulkhead Distributed systems: Isolating resources (thread pools, connections, queues) per dependency so a failure in one doesn't drown all the others. Named after ship compartments: a leak floods one, not the boat.
CAP theorem Distributed systems: Brewer's result that a distributed shared-data system can offer at most two of consistency, availability, and partition tolerance during a network partition. In practice partitions happen, so you are choosing between consistency and availability when one occurs.
Circuit breaker Distributed systems: A wrapper that watches a downstream call and starts failing fast once it has been unhealthy for a while, instead of piling up doomed retries. Lets the system shed load and recover instead of cascading.
Consensus Distributed systems: The problem of getting a group of unreliable nodes to agree on a single value or order of events. Algorithms like Paxos and Raft solve it; they are at the heart of every system that promises strong consistency.
Event-driven architecture Distributed systems: A design where services communicate by emitting and reacting to events on a shared log. Loosely coupled and easy to evolve, but you trade clear request-response flows for choreography you have to reconstruct.
Hot key Distributed systems: A single key that receives a disproportionate share of traffic in a partitioned system. Hot keys defeat the whole point of partitioning by sending one shard's machine to its knees while everyone else idles.
Idempotency Distributed systems: A property where running the same operation many times produces the same result as running it once. Critical in pipelines and APIs because retries, replays, and double-deliveries are the rule, not the exception.
Idempotent Distributed systems: An operation whose effect is the same whether it runs once or many times. The single most useful property to design for in a world where retries are inevitable.
Microservices Distributed systems: An architectural style where a system is split into many small, independently deployed services that talk over the network. Buys team autonomy and isolation; pays the cost in operational and design overhead.
PACELC Distributed systems: An extension of CAP: if there is a Partition you choose between Availability and Consistency, Else (in normal operation) you choose between Latency and Consistency. PACELC describes the trade-off your database makes both during failures and on a normal Tuesday.
Partitioning Distributed systems: Dividing data into chunks that can be stored, queried, or processed independently. In analytical systems this often means partitioning by date or tenant; in streaming it means assigning events to a parallelism unit.
Sharding Distributed systems: Splitting one logical dataset across multiple physical machines so each holds only a slice. Lets a system grow past what one box can hold, at the cost of cross-shard queries and rebalancing pain.
Two-phase commit Distributed systems: A protocol for committing a transaction across several systems: first everyone votes prepare, then a coordinator tells them all to commit or abort. Correct in theory, painful in practice when the coordinator dies mid-flight.

Streaming

At-least-once Streaming: A delivery guarantee where every event reaches the consumer at least once, but possibly more. Cheaper to build than exactly-once and totally fine if your downstream is idempotent.
CDC Streaming: Change Data Capture. A pattern for streaming every insert, update, and delete out of a source database as a sequence of events that downstream systems can apply. The cleanest way to keep data warehouses, search indexes, and caches in step with an OLTP store.
Event time Streaming: The time at which an event actually happened in the real world, carried inside the event payload. Almost always what you want to compute on, even though it forces you to deal with out-of-order arrivals.
Exactly-once Streaming: A delivery guarantee where every event is processed once and only once end-to-end. Real systems achieve it through a combination of idempotent writes, transactional sinks, and offset management rather than by literally delivering each message exactly once.
Kappa architecture Streaming: Lambda's cleaner sibling: one streaming pipeline does everything, and a 'replay from the log' job covers the batch role when you need to recompute history. Simpler, but only if your stream processor is good enough.
Lambda architecture Streaming: A pattern with two parallel pipelines: a slow batch layer that recomputes the truth from scratch and a fast streaming layer that serves fresh-but-approximate results. The cost is maintaining the same logic twice.
Processing time Streaming: The wall-clock time at which a stream processor sees the event. Easier to reason about than event time but produces results that depend on system speed and how late inputs were.
Replay Streaming: Re-reading events from the durable log and re-processing them, usually because logic changed or downstream went wrong. Replay is only safe when your processing is idempotent and side effects are bounded.
Watermark Streaming: A timestamp that marks 'we don't expect events older than this anymore' in a stream. Watermarks let a stream processor close windows and emit results without waiting forever for laggards.

Batch processing

Avro Batch processing: A row-oriented binary format with a schema embedded in every file. Designed for write-heavy, schema-evolving streaming workloads where you read whole records, not individual columns.
Backfill Batch processing: Re-running a pipeline over a historical date range to fix a bug, recover from an outage, or add a new column. Should be safe to do at any time, which is the whole point of designing pipelines to be idempotent.
Broadcast join Batch processing: A join where the small side is shipped in full to every executor and the big side stays put. Avoids the shuffle entirely and turns a slow join into a fast local lookup, when one side fits in memory.
Bucketing Batch processing: Pre-hashing a table on disk into a fixed number of files based on a key. Two bucketed tables with matching keys can be joined without a shuffle, and a bucketed table can be filtered without scanning everything.
ELT Batch processing: Extract, Load, Transform. The modern variant: dump raw data into a powerful warehouse first, then transform it in place using SQL. Possible because warehouses got cheap and fast enough to be the compute engine too.
ETL Batch processing: Extract, Transform, Load. The classic shape of a data pipeline: pull data out of a source, reshape it on a separate compute layer, then push the result into a destination ready for consumption.
Lazy evaluation Batch processing: An execution model where transformations are recorded as a plan and only run when an action forces them. Spark uses it to combine many steps into one optimized job instead of materializing every intermediate.
Medallion architecture Batch processing: A three-tier layout for a data lake: bronze for raw landed data, silver for cleaned and conformed, gold for business-ready aggregates. Useful as a default scaffold; not a religion.
ORC Batch processing: Optimized Row Columnar. A columnar storage format from the Hadoop world, similar in spirit to Parquet but with stronger built-in stats and predicate pushdown. Most common in older Hive-era data lakes.
Parquet Batch processing: A columnar storage format for analytics: data is laid out one column at a time so a query can read just the columns it needs. The default file format for nearly every modern lakehouse.
Shuffle Batch processing: The expensive step where a distributed job rearranges data across the network so rows that need to be processed together end up on the same machine. Joins, group-bys, and repartitions all trigger one.
Skew Batch processing: When one partition holds far more rows than the others, so a single executor finishes long after the rest. The classic cause of a Spark job that says 99% complete and stays there for an hour.

Infrastructure and cloud

Blue-green deploy Infrastructure and cloud: A deployment pattern with two identical environments, blue and green. Traffic goes to one while the other gets the new version; flipping the load balancer is the deploy, and reverting is just flipping back.
Caching Infrastructure and cloud: Keeping a copy of expensive-to-compute data closer to the consumer so the next read is fast. Easy to add, hard to invalidate; the famous quote about cache invalidation being one of the two hardest problems in CS is not a joke.
Canary deploy Infrastructure and cloud: Rolling out a new version to a small slice of traffic first and watching its metrics. If the canary survives, you widen; if it dies, you roll back before most users notice.
Lakehouse Infrastructure and cloud: A storage layer that puts a transactional table format like Delta, Iceberg, or Hudi on top of cheap object storage, so you get warehouse-style ACID and time travel without a warehouse-style price tag.
Multi-region Infrastructure and cloud: Running the same system in two or more geographic regions for resilience or latency. Either active-passive (one region serves, the other waits) or active-active (both serve simultaneously, with all the consistency complexity that implies).
Rate limit Infrastructure and cloud: A cap on how many requests a client can send per unit of time. Protects you from abuse and from a buggy client that suddenly decides to retry forever.

Observability and reliability

Dashboard Observability and reliability: A curated view of metrics for a service: a few SLIs, error rates, request volumes, key dependencies. Good dashboards show health and bad dashboards show every metric you ever exposed.
Error budget Observability and reliability: The complement of an SLO: if you target 99.9% you have 0.1% to spend on incidents and risky launches. When the budget is gone, you stop shipping features and pay down reliability debt.
On-call Observability and reliability: A rotation where engineers take responsibility for responding to production alerts during a defined window. The rule of thumb: if you wrote it, you carry the pager when it breaks.
Postmortem Observability and reliability: A blameless write-up after an incident: timeline, root cause, contributing factors, action items. The point is to make the system safer, not to find someone to yell at.
Runbook Observability and reliability: A short, opinionated document that tells the on-call engineer exactly what to do when a specific alert fires. The more boring and copy-paste it reads at 3 a.m., the better.
SLA Observability and reliability: Service Level Agreement. The contractual promise to a customer about reliability, usually with money attached if you miss it. Always looser than the SLO you target internally.
SLI Observability and reliability: Service Level Indicator. The actual measurement that feeds your SLO: a request latency, a success rate, a freshness. The thing you put on a dashboard and alert on.
SLO Observability and reliability: Service Level Objective. The internal target for a service's reliability, like '99.9% of read requests in under 200 ms over 30 days'. The lever you actually steer engineering work with.

Finance and investing

Asset allocation Finance and investing: How you split a portfolio between major buckets like stocks, bonds, and cash. Decades of evidence say this single decision drives most of your long-run risk and return, far more than which specific funds you pick.
Capital gains tax Finance and investing: Tax on the profit when you sell an asset for more than you paid. In Italy most financial gains are taxed at 26%, with a softer 12.5% rate for government bonds and a few special cases.
Compound interest Finance and investing: Interest earned not just on your original principal but also on the interest already accumulated. Over decades it is the difference between linear and exponential growth, and it is the entire engine of long-term investing.
Diversification Finance and investing: Spreading money across many assets so no single one can wreck you. Diversification reduces risk without lowering expected return; it is famously called the only free lunch in finance.
Dollar-cost averaging Finance and investing: Investing a fixed amount on a regular schedule no matter what the market is doing. Smooths your average entry price and, more importantly, removes the urge to time the market. In Italy this is what a PAC ('Piano di Accumulo Capitale') is.
Emergency fund Finance and investing: Liquid savings, usually three to six months of expenses, kept in cash or near-cash for unexpected shocks. Boring on purpose: it exists so a broken boiler or a lost job doesn't force you to sell investments at the worst possible moment.
ETF Finance and investing: Exchange-Traded Fund. A fund that holds a basket of assets (often a whole index) and trades on the stock exchange like a single share. Cheap, transparent, and the default building block of a sane portfolio.
Expense ratio Finance and investing: Another name for TER on US-domiciled funds: the percentage of fund assets eaten by management costs each year. Lower is almost always better.
FIRE Finance and investing: Financial Independence, Retire Early. A movement and a math problem: save aggressively, invest in low-cost index funds, and stop when your portfolio can fund your spending forever. Different tiers (lean, regular, fat) describe the lifestyle you target.
Fondo pensione Finance and investing: An Italian supplementary pension vehicle. Contributions are tax-deductible up to a yearly cap, growth is taxed at a softer rate than ordinary investments, and payouts can be a mix of lump sum and annuity.
Inflation Finance and investing: The general rise in prices over time, which means the same euro buys less next year than this year. A small inflation rate is normal and expected; the danger is for cash sitting still while prices keep moving.
Lump sum Finance and investing: Investing a chunk of money all at once instead of spreading it out. Usually wins on average versus DCA because markets rise more often than they fall, but feels worse when the timing is unlucky.
Nominal return Finance and investing: The headline percentage growth you see on your statement, before adjusting for inflation. Useful for comparing products in the same currency and period; misleading on its own.
Opportunity cost Finance and investing: The value of the next-best thing you give up by making a choice. Cash sitting in a current account doesn't 'cost' anything visible, but the return it could have earned in an ETF is its opportunity cost.
Real return Finance and investing: The growth of your money after subtracting inflation. The only return that matters for long-term planning, since 7% nominal in a 5% inflation year is a rounding error in real terms.
Rebalancing Finance and investing: Selling a bit of what has grown and buying a bit of what has lagged so your portfolio stays close to its target allocation. A boring habit that quietly imposes 'sell high, buy low' discipline.
Safe Withdrawal Rate Finance and investing: The percentage of a portfolio you can withdraw each year, adjusted for inflation, without running out over a long retirement. The famous '4% rule' comes from US data; for European investors closer to 3% to 3.5% is more honest.
Savings rate Finance and investing: The fraction of your net income that you don't spend. The single biggest lever in personal finance: doubling it can roughly halve your time to financial independence, no matter what your portfolio returns.
TER Finance and investing: Total Expense Ratio. The annual fee an ETF or fund charges, expressed as a percentage of assets. A 1% TER versus a 0.2% TER on the same return compounds into a six-figure difference over a 30-year career.
TFR Finance and investing: Trattamento di Fine Rapporto. A piece of an Italian employee's salary that the employer holds back and pays out at the end of the contract. You can leave it with the company or move it to a fondo pensione, with very different tax outcomes over decades.
Volatility Finance and investing: How much an asset's price swings around, usually measured as the standard deviation of returns. Volatility is risk for short-term holders and noise for long-term ones; the trick is knowing which you are.

Machine learning

Cross-validation Machine learning: Splitting the training data into folds, training on some and testing on the rest, then rotating. Gives a more honest estimate of how the model will perform than a single train/test split.
Feature engineering Machine learning: Turning raw data into the inputs a model can actually use: ratios, dates broken into pieces, categories encoded as numbers. Often the part that decides whether a model is good, no matter how fancy the algorithm is.
Overfitting Machine learning: When a model memorizes the training data so well that it fails on anything new. Like a student who can recite every past exam but flunks a fresh question.
Train/test split Machine learning: Holding back a portion of the data and never letting the model see it during training, so you can later check how it performs on truly unseen examples. The minimum viable defense against fooling yourself.