Graph databases: Neo4j, when relationships are the data

There is a small set of problems for which the relational model is genuinely a bad fit, and the symptom is always the same: your queries grow a forest of JOIN clauses, each one walking one step further along a chain of foreign keys, and the query plan goes from acceptable to terrible somewhere around the fifth hop. The example everyone reaches for is “friends of friends of friends,” but the same shape shows up in fraud detection, recommendation engines, knowledge graphs, and any system where the relationships between entities matter as much as the entities themselves. This lesson is about the family of databases built for exactly this shape.

The headline is short: when relationships are 80% of your queries, a graph database makes those queries trivial and fast. When relationships are 20% of your queries, a relational database with recursive CTEs is fine. The middle ground is where the interesting decisions live.

The data model

A graph database stores two things: nodes and edges.

A node is an entity. A user, a product, a company, a transaction. It has a label (the type) and a set of properties (key-value pairs). A typical node might be (:Person {name: "Narcis", age: 33}), where Person is the label and the curly-brace block is the properties.

An edge is a relationship between two nodes. It has a type, a direction, and optionally its own properties. A typical edge might be (narcis)-[:KNOWS {since: 2018}]->(maria), meaning a directed KNOWS relationship from narcis to maria, with a property since recording when the relationship started.

There is no separate schema in the relational sense. The schema is the graph itself. You can add new node types, new relationship types, new properties at will. The shape of the data is whatever you put in, and the database is happy to traverse it.

flowchart LR
    U1[User: Narcis] -- KNOWS --> U2[User: Maria]
    U2 -- KNOWS --> U3[User: Luca]
    U1 -- BOUGHT --> P1[Product: Espresso machine]
    U2 -- BOUGHT --> P1
    U3 -- LIKED --> P2[Product: Italian moka pot]
    U2 -- LIKED --> P2

A small graph like this lets you ask, in one query: “what products are liked or bought by people my friends know but I do not yet know about.” That query, in SQL, is a multi-self-join across a users table and a friendships table and a purchases table, with a recursive step to expand the friendship chain. In a graph database, it is four lines.

The query language

The dominant query language for property graphs is Cypher, originally from Neo4j and now adopted (with minor variations) by AuraDB, Memgraph, AWS Neptune (in its property-graph mode), and the Apache AGE extension for Postgres. Cypher’s defining feature is that the syntax for matching a graph pattern looks like an ASCII drawing of the pattern.

MATCH (a:Person)-[:KNOWS]->(b:Person)-[:KNOWS]->(c:Person)
WHERE a.name = "Narcis" AND a <> c
RETURN c.name

This finds friends of friends of Narcis, excluding Narcis herself. The MATCH clause draws the pattern: a Person node a, connected by a KNOWS edge to a Person node b, connected by another KNOWS edge to a Person node c. The WHERE filters. The RETURN projects.

Reading Cypher takes ten minutes to get used to and then becomes natural. Writing complex traversals (variable-length paths, shortest paths, weighted traversals) is dramatically more concise than the SQL equivalent. The language is the main reason graph databases feel ergonomic for the workloads they target.

The W3C also defines SPARQL, a query language for RDF triple stores, which is a different graph model (subject-predicate-object triples, no properties on edges, schema described separately). SPARQL is the standard in the semantic-web and academic-knowledge-graph world. It is an excellent language for what it does, and most people building application graph databases today choose Cypher over SPARQL because the property-graph model maps more naturally to the kinds of objects applications care about.

Where graph databases shine

The use cases where the graph model unambiguously wins are the ones where multi-hop relationship traversal is the core question.

Recommendation engines

“Users like you also liked X” is the canonical recommendation problem, and at its heart it is a graph traversal. Find users connected to me by some signal (similar purchases, similar ratings, similar social graph), look at what those users have engaged with, surface the items I have not seen yet. In a graph database this is a one-query operation. In a relational database it is a batch job that materialises a similarity matrix overnight.

Fraud detection

Financial fraud often shows up as patterns in the relationship graph that are invisible at the level of individual transactions. A ring of accounts that all transfer money to each other in a circle. An account that suddenly receives transfers from twenty new sources. A device fingerprint shared across accounts that should not know each other. These are graph patterns, and detecting them in real time is what graph databases are particularly good at. Banks and payment processors are the most consistent users of graph databases for this reason.

Friend-of-a-friend, mutual connections, shortest path between two users, identifying influencers (high-centrality nodes). These are the textbook graph-traversal problems, and they are exactly what social-network features look like.

Knowledge graphs

Wikidata, the Google Knowledge Graph, and a growing crop of enterprise knowledge graphs store entities (people, places, concepts, events) with rich, typed relationships between them. The questions you ask of a knowledge graph (give me all Italian films directed by women in the 1970s, with the actors who appeared in them) are multi-hop traversals across typed edges. A graph database is the natural store.

Network and dependency graphs

Software dependencies, infrastructure topology, supply chains, organisational hierarchies. Anything where the question “what depends on X, transitively” is a real question. These are graphs, and a graph database makes the transitive question a primitive instead of an essay.

The relational alternative

If your problem looks like a graph but is small, or has only a handful of relationship types, or is mostly tabular with a few graph-shaped queries on top, you can do the work in SQL with recursive common table expressions (CTEs).

A recursive CTE in Postgres looks like this for friend-of-a-friend:

WITH RECURSIVE friends(person_id, depth) AS (
    SELECT b.person_id, 1
    FROM friendships b
    WHERE b.friend_id = (SELECT id FROM persons WHERE name = 'Narcis')

    UNION ALL

    SELECT b.person_id, f.depth + 1
    FROM friendships b
    JOIN friends f ON b.friend_id = f.person_id
    WHERE f.depth < 3
)
SELECT DISTINCT p.name
FROM friends f
JOIN persons p ON p.id = f.person_id
WHERE f.depth = 2;

This works. It is not pretty, and it is not what an engineer reads on a busy afternoon, but it gives the right answer for two or three hops and small-to-medium graphs. Where it falls apart is when the graph is large, the depth is unbounded, or you start adding constraints on the path itself (only edges with weight > X, only certain edge types, shortest-path-by-weight). Each of these is a few extra clauses in a graph query and an additional layer of complexity in the recursive CTE.

The honest summary: SQL recursive CTEs are fine up to two or three hops on a moderate graph. Past that, the cognitive cost of the SQL grows faster than the cognitive cost of the equivalent Cypher, and the performance starts to degrade.

The cost: a worked example

Take the query “find all products bought by friends of friends of Narcis, with the count of how many of those friends bought each product, sorted by count.”

In Cypher:

MATCH (n:Person {name: "Narcis"})-[:KNOWS*2]-(friend:Person)-[:BOUGHT]->(p:Product)
WHERE friend <> n
RETURN p.name, count(DISTINCT friend) AS buyers
ORDER BY buyers DESC
LIMIT 10

Four lines of substance. Reads like a description of the problem.

In Postgres with recursive CTEs:

WITH RECURSIVE friend_chain(person_id, depth) AS (
    SELECT id, 0 FROM persons WHERE name = 'Narcis'
    UNION ALL
    SELECT f.friend_id, fc.depth + 1
    FROM friendships f
    JOIN friend_chain fc ON f.person_id = fc.person_id
    WHERE fc.depth < 2
),
fof AS (
    SELECT DISTINCT person_id
    FROM friend_chain
    WHERE depth = 2
      AND person_id <> (SELECT id FROM persons WHERE name = 'Narcis')
)
SELECT p.name, count(DISTINCT pu.person_id) AS buyers
FROM fof
JOIN purchases pu ON pu.person_id = fof.person_id
JOIN products p ON p.id = pu.product_id
GROUP BY p.name
ORDER BY buyers DESC
LIMIT 10;

A page of SQL. It works. It is harder to read, harder to modify, and on a graph of millions of users and tens of millions of relationships, it will be slower than the Cypher version, sometimes by a wide margin.

The product landscape

Neo4j: the dominant choice

Neo4j is the graph database most teams reach for first, and for good reason. It has been around since 2007, the Cypher language was invented there, the documentation is excellent, the community is large, and the tooling (Neo4j Browser, Bloom, the Graph Data Science library) is mature. There is an open-source community edition and an enterprise edition with clustering, security, and operational features.

The trade-off, as with most “the dominant single-product choice” stories, is that Neo4j is its own database. You run it separately, you back it up separately, you scale it separately. For workloads where the graph is the heart of the application, this is fine. For workloads where the graph is one feature among many, it is operational overhead.

AWS Neptune

Neptune is AWS’s managed graph database. It supports both property graphs with Gremlin/openCypher and RDF with SPARQL. The two-model support is unusual and useful if you have a foot in both worlds. The pitch is the standard managed-database pitch: less operational burden if you are already on AWS. Functionality and performance are competitive with Neo4j, though the openCypher implementation is not 100% feature-complete relative to Neo4j’s Cypher.

Memgraph

Memgraph is an in-memory graph database, written in C++, with a Cypher-compatible query language. Because it is in-memory, it is significantly faster than disk-based graph databases for the workloads that fit in RAM. The pitch is for streaming graph workloads (real-time fraud detection, real-time recommendation) where latency matters and the working set fits in memory.

Postgres + Apache AGE

Apache AGE is a Postgres extension that adds graph storage and a Cypher-like query language on top of a standard Postgres database. The advantage is that you keep one database. You can mix relational queries and graph queries in the same transaction, and you do not run a second store. The disadvantages are that AGE is less feature-rich than Neo4j, the performance is not as good for graph-heavy workloads, and the ecosystem is smaller.

For a team that already runs Postgres and wants to add modest graph functionality without operational overhead, AGE is a reasonable choice. For a team where the graph is central, Neo4j is the better answer.

Making the call

The decision rule, distilled:

If relationships are the core of your application (fraud detection, recommendation engine, knowledge graph, social network), use a dedicated graph database. Neo4j is the safe default.
If you have one or two graph-shaped queries on top of a mostly-relational application, write them as recursive CTEs in your existing relational database and live with the awkwardness.
If you are in between, and you are already on Postgres, try Apache AGE before adding a second database.
If you are on AWS and you do not want to run anything yourself, Neptune is fine.

The most common mistake is the opposite of the lesson: teams reach for a graph database because graph databases are interesting, decide that everything is a graph, and end up with a system where the graph database is doing 5% of the work and 50% of the operational burden. Most data is tabular. Use the graph store for the actual graph problem. Use the relational store for the rest.

What is next

The next lesson is about search engines, where the data shape is yet again different (text, with relevance scoring) and the dominant store (Elasticsearch and the alternatives that have grown up alongside it) does things that no general-purpose database can match. After that we leave the storage zoo and move into the operational layer of Module 3: replication, sharding, partitioning, and the practicalities of running these systems in production.

Citations and further reading

Neo4j documentation, https://neo4j.com/docs/ (retrieved 2026-05-01). The reference for Cypher, the operational model, and the Graph Data Science library.
The Cypher language reference (openCypher), https://opencypher.org/ (retrieved 2026-05-01). The community-maintained language spec.
Ian Robinson, Jim Webber, Emil Eifrem, “Graph Databases” (O’Reilly, 2nd edition 2015). The foundational textbook. Older but still the best introduction to the data model and the design patterns.
AWS Neptune documentation, https://docs.aws.amazon.com/neptune/ (retrieved 2026-05-01). The dual property-graph and RDF model, with managed-service operational notes.
Memgraph documentation, https://memgraph.com/docs (retrieved 2026-05-01). In-memory architecture and the streaming-graph use case.
Apache AGE documentation, https://age.apache.org/age-manual/master/index.html (retrieved 2026-05-01). The Postgres extension and its current capabilities.