For the last six lessons we have been categorising data stores by the shape of their data: tables, documents, key-value pairs, columns, graphs, time series. Each of those categories existed before the previous one became fashionable, and each of them solves a specific kind of query well. This lesson is about a category that did not really exist as a separate product line before 2022, and that the entire industry now treats as table stakes for any application built on top of a large language model. The category is the vector database, and its rise is one of the most abrupt shifts in data-infrastructure history.
The reason it appeared so suddenly is that the workload it serves did not exist at scale before. The workload is “given this query, find the most semantically similar documents in a corpus of millions or billions of items, in under a hundred milliseconds.” That query, phrased that way, is the workhorse of every retrieval-augmented generation pipeline, every modern semantic search box, every “find me products like this one” feature, and a long list of less-obvious use cases. The technology that makes the query cheap enough to be a building block is the topic of this lesson.
The data model: rows are points in space
A vector database stores rows where each row is, primarily, a high-dimensional array of floats. The arrays are typically between 384 and 3072 dimensions, depending on which embedding model produced them. A 384-dimensional vector is what all-MiniLM-L6-v2 from sentence-transformers gives you. A 1536-dimensional vector is what OpenAI’s older text-embedding-ada-002 produced. A 3072-dimensional vector is what text-embedding-3-large produces at full size. Anthropic’s Voyage embeddings, Cohere’s embed-v3, Google’s text-embedding-005: same idea, different dimensions, different training corpora.
Alongside the vector, each row carries metadata: an identifier, the original text or pointer to it, tags, timestamps, anything you want to filter on. The metadata is what lets you say “find me the K nearest vectors, but only among documents tagged ‘pricing’ and modified in the last thirty days.”
The query is not “give me the row with primary key X.” It is “give me the K rows whose vectors are closest to this query vector.” Closeness is measured by one of three distances:
- Cosine similarity: the angle between two vectors, ignoring magnitude. The most common choice for text embeddings, because the magnitude of an embedding does not carry meaning, only the direction does.
- Euclidean distance (L2): the straight-line distance between two points. Useful when magnitude matters, for instance when the embedding model was trained with L2 in mind.
- Dot product: cosine similarity multiplied by the magnitudes. Faster than cosine because there is no normalisation step. Some models are explicitly trained to be queried with dot product.
If you pick the wrong distance for your model, your results are quietly worse but not obviously broken, which is the worst kind of bug. Read the embedding model’s documentation, use the distance it recommends, and write a test that verifies the recommendation has not changed when you upgrade.
Why vectors became important
The reason this stopped being a research curiosity and became a product category is that text embeddings got good. Before about 2020, turning a paragraph into a fixed-size vector that captured its meaning was hard and the results were weak. The vectors clustered by topic, broadly, but you could not rely on “these two paragraphs are about the same thing” being a property of “their vectors are close.” Now you can. Sentence-transformers, OpenAI’s embedding line, Cohere, Voyage, the open-weights options like BGE and Nomic: any of them produces embeddings where semantic similarity in the text reliably maps to geometric closeness in the vector space.
Once that mapping is reliable, a long list of applications becomes a one-liner over a vector index:
- Retrieval-augmented generation (RAG): before asking the LLM “summarise our refund policy,” embed the question, fetch the top K most similar passages from your knowledge base, and feed them to the LLM as context. The LLM stops hallucinating because it is reading the actual policy. RAG is, by 2026, the default architecture for any chatbot or assistant grounded in private data.
- Semantic search: the search box on your documentation site stops requiring the user to guess your keywords. “How do I cancel my subscription” finds the page titled “Ending your plan” because the embeddings know they are about the same thing.
- Recommendations: “products like the one you just viewed” becomes a nearest-neighbor query over the product catalog’s embeddings. No more cold-start, no more painstaking taxonomy-based similarity rules.
- Image and audio similarity: the same trick works for non-text data. CLIP-style models embed images into the same kind of vector space. You can deduplicate a photo library, find lookalikes, do reverse image search, all with the same nearest-neighbor primitive.
- Anomaly detection: cluster the vectors of your normal events. Anything that lands far from any cluster is a candidate anomaly. Useful for fraud, intrusion detection, defective items on a manufacturing line.
The unifying observation is that “things that mean similar stuff” become “vectors that are close to each other,” and a vector database is the index that makes “find the close ones” cheap.
ANN: approximate nearest neighbor
Exact nearest-neighbor search in high dimensions is expensive. To find the single closest vector to a query in a corpus of one million 1536-dimensional vectors, you have to compute one million distances, each of which is a 1536-element dot product. That is doable on modern hardware but slow. Doing it at p99 latency under 50 milliseconds, for thousands of queries per second, on a corpus of a billion vectors, is not.
The trick that makes vector search practical is to give up on exact nearest-neighbor and accept approximate nearest-neighbor (ANN). You build an index that, for any query, returns the K nearest vectors with very high probability, but with a small chance of missing one of them and including a slightly-farther one instead. In exchange, the query becomes orders of magnitude faster.
The dominant ANN algorithm in 2026 is HNSW (Hierarchical Navigable Small World). The intuition: build a layered graph where the top layer connects far-apart points and each lower layer adds finer-grained connections. To search, you start at the top, greedily walk toward the query, descend a layer, repeat. The graph structure means each query touches a small number of vectors instead of all of them. HNSW is what Pinecone, Qdrant, Weaviate, and pgvector all use by default, in slightly different implementations.
Two other algorithms come up. IVF (Inverted File) clusters vectors with k-means, then at query time only searches the closest few clusters. Faster to build than HNSW, slightly less accurate at the same recall. Often combined with product quantization (PQ) to compress the vectors themselves. ScaNN, from Google, is a more recent design that combines partitioning, anisotropic quantization, and a careful re-ranking step. ScaNN is what BigQuery and Vertex AI use under the hood.
The practical implication is that you tune two knobs. The first is recall, the fraction of true nearest-neighbors that the ANN index actually returns. 95% recall is typical for production. 99%+ is achievable but slower. The second is latency, which is inversely correlated with recall. You pick a point on the curve that fits your application and you measure both.
The 2024-2026 vector-database landscape
The category has consolidated around a handful of options.
Pinecone. The first vector database to be a serious commercial product. Fully managed, hosted only. Pleasant API, strong on reliability. Expensive at scale, and the lock-in is real because there is no Pinecone you can run yourself. The right call when you want zero operational burden and your vector volume is small enough that the bill is acceptable. The wrong call when your vector count crosses tens of millions and the bill starts dominating your infrastructure costs.
Qdrant. Open-source, written in Rust. Self-host or use the managed Qdrant Cloud. Fast, well-documented, with a clean filtering API that combines vector search with metadata predicates efficiently. The open-source-with-a-cloud-option model is the dominant pattern in this generation of vector DBs, and Qdrant is one of the strongest implementations.
Weaviate. Open-source, GraphQL-based query interface, built-in support for hybrid search (vector similarity combined with traditional keyword search via BM25). For text-heavy search where users sometimes type questions and sometimes type exact terms, hybrid result quality is meaningfully better than pure-vector or pure-keyword. Weaviate also bundles vectoriser modules that let you skip the “embed your text yourself” step, which is convenient for prototyping and an unnecessary coupling for production.
Milvus. Open-source, distributed architecture designed from the start for billions of vectors. The complexity is correspondingly higher: more moving parts, more operational surface. The right call if you have a billion-vector workload and the team to run a distributed system. The wrong call for a startup that has not yet hit ten million vectors.
pgvector. A Postgres extension that adds a vector column type, distance operators, and an HNSW index. If you are already on Postgres, and your vector volume is under 50 million or so, pgvector is the right answer almost by default. You get vector search alongside your existing relational data, in the same transaction, with the same backup and replication and access control story. Performance is competitive with dedicated vector databases at this scale. Above 100M vectors dedicated systems pull ahead, but most applications never reach that ceiling.
Elasticsearch and OpenSearch. Both have added native vector field types and HNSW indexes. The mirror image of pgvector’s story: if you already run Elasticsearch, adding vector search to the same cluster is straightforward and the hybrid query support is genuinely good. If you are not already on Elasticsearch, do not adopt it just for vectors.
The decision tree, simplified:
- Already on Postgres, under 50M vectors: pgvector.
- Already on Elasticsearch or OpenSearch, want hybrid search: their native vector fields.
- Need hybrid search and not on either of the above: Weaviate.
- Need a managed service and the bill is fine: Pinecone.
- Self-hosting, want simplicity: Qdrant.
- Billions of vectors, dedicated team: Milvus.
A RAG pipeline, end to end
The most common shape of a vector-database deployment in 2026 is a retrieval-augmented generation pipeline. The mechanics:
flowchart LR
subgraph Ingest
A[Source documents] --> B[Chunker]
B --> C[Embedding model]
C --> D[(Vector DB)]
end
subgraph Query
Q[User question] --> E[Embedding model]
E --> F[Vector DB lookup]
F --> G[Top K passages]
G --> H[LLM with context]
H --> R[Answer]
end
D -.-> F
Two halves: the ingest pipeline, which runs whenever documents change, and the query path, which runs on every user question. Ingest splits long documents into chunks (commonly 200 to 800 tokens each, with overlap), embeds each chunk, and writes the vector plus the chunk text and metadata to the database. Query embeds the user’s question with the same model, fetches the top K most similar chunks, and feeds them to the LLM as context.
The details matter. Chunk size affects retrieval quality: too small and the chunks lack context, too large and irrelevant content dilutes the relevant signal. The same model has to be used for both ingest and query, otherwise the vectors live in different spaces and the distance is meaningless. K is usually between 3 and 20 depending on the LLM’s context window and the application’s tolerance for noise. The system prompt has to instruct the LLM to ground its answer in the provided context and to say “I do not know” when the context does not contain the answer, otherwise the hallucination problem returns.
Vectors do not replace relational
The framing that helped me when I was first getting comfortable with this category was: a vector database is not a replacement for your relational database, it is an additional index for an additional kind of query. Your transactional data still belongs in Postgres. Your inventory, your users, your orders, your audit log: same as before. The vector database holds embeddings of the parts of that data you want to search semantically, plus metadata pointers back to the source rows.
A common deployment is Postgres as the system of record, with a vector database (or pgvector inside the same Postgres) as a secondary index. Documents change, a background job re-embeds them and updates the vector store. The vector store is never authoritative for anything; it is a search index. Treat it that way and the architecture stays clean.
The next lesson asks the broader question that this one raises: when does it make sense to run multiple specialised data stores like this, and when is the operational cost of polyglot persistence higher than the benefit? Module 3 finishes there.
Citations and further reading
- Pinecone documentation,
https://docs.pinecone.io(retrieved 2026-05-01). The reference for the managed-service flavour of the category. - Qdrant documentation,
https://qdrant.tech/documentation/(retrieved 2026-05-01). Covers the HNSW implementation, filtering, and the managed-vs-self-hosted decision. - Weaviate documentation,
https://weaviate.io/developers/weaviate(retrieved 2026-05-01). The reference for hybrid search and the GraphQL-style query interface. - pgvector,
https://github.com/pgvector/pgvector(retrieved 2026-05-01). The Postgres extension that quietly disrupted the standalone-vector-DB market. - Yu A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs”, IEEE TPAMI 2020. The HNSW paper.
- Google Research, “Accelerating Large-Scale Inference with Anisotropic Vector Quantization”, 2020. The ScaNN paper.