Ai-engineering · June 1, 2026

Graph RAG: When Knowledge Graphs Beat Vector Search

How entity extraction, graph traversal, and vector retrieval work together to answer questions that standard RAG cannot

by Perivitta 24 mins read Advanced
Share
Back to all posts

Graph RAG: When Knowledge Graphs Beat Vector Search

Introduction

Standard retrieval-augmented generation works well when the answer to a question lives in a single document chunk. You embed the query, find the closest chunks by cosine similarity, and feed them to the LLM. The architecture is simple, fast, and effective for the majority of factual questions.

But a class of questions systematically breaks this approach. "Which engineers worked on both the payments service and the authentication module?" The answer requires connecting information across multiple documents: one about payments contributors, one about auth contributors, and potentially organisational charts linking both. No single chunk contains the answer. Cosine similarity finds chunks about payments and chunks about auth, but not the relation between them. The LLM receives disconnected fragments and either hallucinates a connection or admits it cannot answer.

Graph RAG solves this by building a structured knowledge graph from the document corpus. Entities (people, systems, concepts) become nodes. Relations (worked on, depends on, reports to) become edges. Multi-hop questions become graph traversals that produce exactly the connected context the LLM needs.


1. Where Standard RAG Fails

The failure pattern is consistent: standard RAG retrieves topically relevant chunks but cannot retrieve relationally connected information. Four question types expose this gap.

Question type Example Standard RAG result Graph RAG result
Multi-hop "Who approved the change that caused the outage?" Finds outage docs and approval docs separately; misses the link Traverses: outage → change ID → approver node → name
Aggregation "Which services depend on the auth module?" Returns chunks mentioning auth; misses implicit dependencies Queries all edges of type DEPENDS_ON pointing to auth node
Comparative "How does our caching strategy differ between service A and B?" Retrieves caching chunks; may get A-only or B-only context Retrieves both service nodes with their caching-relation edges
Temporal chain "What events led to the database migration last quarter?" Finds migration chunk; loses causal chain Traverses temporal edges backwards from migration event

2. Knowledge Graphs: Entities, Relations, Triplets

A knowledge graph stores information as (subject, predicate, object) triplets, also called RDF triples. Every fact in your document corpus becomes one or more triplets.

From the sentence "Alice, a senior engineer, deployed the payments service on 2026-04-12", a graph extractor produces:

  • (Alice, IS_A, Senior Engineer)
  • (Alice, DEPLOYED, Payments Service)
  • (Deployment, OCCURRED_ON, 2026-04-12)
  • (Alice, PERFORMED, Deployment)
Alice Engineer Senior Engineer role IS_A Deployment 2026-04-12 PERFORMED Payments Service system DEPLOYED_TO Auth Module system DEPENDS_ON Bob Engineer APPROVED_BY Multi-hop: "Who approved the deployment that affected the Auth Module?", three edge traversals, one answer.
Figure 1: A small knowledge graph. The multi-hop question "Who approved the deployment that affected the Auth Module?" requires traversing three edges. Vector search would retrieve disconnected chunks about deployments and the auth module separately.

3. The Graph RAG Architecture

Graph RAG adds two stages to the standard RAG pipeline: graph construction (offline, one-time per corpus) and hybrid retrieval (online, per query). The LLM call at the end is identical to standard RAG.

OFFLINE (build once) Documents raw corpus Entity + Rel Extraction Knowledge Graph (Neo4j) ONLINE (per query) Query + entities Graph Traversal Vector Search LLM Graph paths + vector chunks → combined context → LLM generates answer Standard RAG (for comparison): Query Vector Search LLM ← misses relations
Figure 2: Graph RAG pipeline. Offline: extract entities and relations, build a knowledge graph. Online: for each query, traverse the graph for connected context, run vector search for relevant passages, merge both into the LLM prompt.

4. Step 1: Entity and Relation Extraction

The first step transforms unstructured text into (subject, predicate, object) triplets. Two approaches exist: a specialised NER model (e.g. spaCy) for fast entity detection, or an LLM for both entities and relations. The LLM approach is slower but handles domain-specific relation types without custom training. The implementation below uses the LLM approach, which works out of the box without a labelled NER training set.


pip install anthropic networkx neo4j sentence-transformers

# extractor.py
import anthropic
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class Triple:
    subject:   str
    predicate: str
    object:    str
    source:    str   # document chunk ID for provenance

EXTRACT_TOOL = {
    "name": "submit_triples",
    "description": "Submit extracted knowledge graph triples from the text.",
    "input_schema": {
        "type": "object",
        "properties": {
            "triples": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "subject":   {"type": "string"},
                        "predicate": {"type": "string"},
                        "object":    {"type": "string"},
                    },
                    "required": ["subject", "predicate", "object"]
                }
            }
        },
        "required": ["triples"]
    }
}

EXTRACT_SYSTEM = """You extract knowledge graph triples from technical documentation.
A triple is (subject, predicate, object) where:
- subject and object are named entities (people, systems, services, concepts, dates)
- predicate is a concise verb phrase in UPPER_SNAKE_CASE (e.g. DEPLOYED_BY, DEPENDS_ON, APPROVED_BY)

Extract all factual triples. Prefer specific, unambiguous entities over pronouns or vague references."""

def extract_triples(text: str, chunk_id: str) -> list[Triple]:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=EXTRACT_SYSTEM,
        tools=[EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "submit_triples"},
        messages=[{"role": "user", "content": f"Extract triples from:\n\n{text}"}],
    )
    for block in response.content:
        if block.type == "tool_use" and block.name == "submit_triples":
            return [
                Triple(t["subject"], t["predicate"], t["object"], chunk_id)
                for t in block.input.get("triples", [])
            ]
    return []

# Example
triples = extract_triples(
    "Alice, a senior engineer, deployed the Payments Service on 2026-04-12. "
    "The deployment was approved by Bob. The Payments Service depends on the Auth Module.",
    chunk_id="doc_001_chunk_3"
)
for t in triples:
    print(f"  ({t.subject}, {t.predicate}, {t.object})")
# (Alice, IS_A, Senior Engineer)
# (Alice, DEPLOYED, Payments Service)
# (Payments Service, DEPLOYED_ON, 2026-04-12)
# (Alice, DEPLOYMENT_APPROVED_BY, Bob)
# (Payments Service, DEPENDS_ON, Auth Module)

Extraction cost note: Each LLM call processes one document chunk. At claude-sonnet-4-6 pricing ($3/$15 per million input/output tokens), extracting triples from a 500-token chunk costs roughly $0.002. For a 10,000-chunk corpus, expect $15–30 in extraction costs, a one-time offline cost, not a per-query cost.


5. Step 2: Building the In-Memory Knowledge Graph

For development and small corpora (under 50,000 nodes), NetworkX provides a sufficient in-memory graph with fast traversal. You can run multi-hop queries directly with NetworkX path-finding algorithms before committing to a production graph database.


# graph_builder.py
import networkx as nx
from extractor import Triple

class KnowledgeGraph:
    def __init__(self):
        self.g = nx.MultiDiGraph()   # directed, allows multiple edges between same nodes

    def add_triple(self, t: Triple) -> None:
        self.g.add_node(t.subject)
        self.g.add_node(t.object)
        self.g.add_edge(t.subject, t.object, predicate=t.predicate, source=t.source)

    def add_triples(self, triples: list[Triple]) -> None:
        for t in triples:
            self.add_triple(t)

    def neighbors(self, entity: str, depth: int = 2) -> list[dict]:
        """Return all nodes within 'depth' hops of 'entity' with their connecting edges."""
        if entity not in self.g:
            return []
        results = []
        visited = {entity}
        frontier = [entity]

        for _ in range(depth):
            next_frontier = []
            for node in frontier:
                for _, target, _key, data in self.g.out_edges(node, data=True, keys=True):
                    results.append({"from": node, "relation": data["predicate"], "to": target})
                    if target not in visited:
                        visited.add(target)
                        next_frontier.append(target)
                for source, _, _key, data in self.g.in_edges(node, data=True, keys=True):
                    results.append({"from": source, "relation": data["predicate"], "to": node})
                    if source not in visited:
                        visited.add(source)
                        next_frontier.append(source)
            frontier = next_frontier

        return results

    def find_paths(self, source: str, target: str, max_depth: int = 4) -> list[list]:
        """Find all simple paths between two entities up to max_depth hops."""
        try:
            paths = list(nx.all_simple_paths(self.g, source, target, cutoff=max_depth))
            return paths
        except (nx.NodeNotFound, nx.NetworkXNoPath):
            return []

    def subgraph_as_text(self, entity: str, depth: int = 2) -> str:
        """Convert the neighbourhood of an entity into a readable text block for the LLM."""
        triples = self.neighbors(entity, depth)
        if not triples:
            return f"No connections found for '{entity}'."
        lines = [f"Knowledge graph context for '{entity}':"]
        for t in triples:
            lines.append(f"  {t['from']} --[{t['relation']}]--> {t['to']}")
        return "\n".join(lines)

6. Step 3: Hybrid Retrieval with Graph and Vector Search

At query time, Graph RAG runs two retrievers in parallel and merges their outputs before the LLM call. The graph retriever finds structurally connected context. The vector retriever finds semantically similar passages. Together they provide both relational precision and semantic coverage.


# graph_rag.py
import numpy as np
import anthropic
from sentence_transformers import SentenceTransformer
from graph_builder import KnowledgeGraph

client   = anthropic.Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")

ENTITY_EXTRACT_TOOL = {
    "name": "submit_entities",
    "description": "Submit the named entities found in the query.",
    "input_schema": {
        "type": "object",
        "properties": {
            "entities": {"type": "array", "items": {"type": "string"},
                        "description": "Named entities: people, systems, services, dates"}
        },
        "required": ["entities"]
    }
}

def extract_query_entities(query: str) -> list[str]:
    """Use Claude to extract named entities from the user query."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system="Extract all named entities (people, systems, services, products, dates) from the query.",
        tools=[ENTITY_EXTRACT_TOOL],
        tool_choice={"type": "tool", "name": "submit_entities"},
        messages=[{"role": "user", "content": query}],
    )
    for block in response.content:
        if block.type == "tool_use":
            return block.input.get("entities", [])
    return []

def graph_retrieve(query: str, graph: KnowledgeGraph, depth: int = 2) -> str:
    """Retrieve graph context by traversing from query entities."""
    entities = extract_query_entities(query)
    if not entities:
        return ""
    context_parts = []
    for entity in entities:
        ctx = graph.subgraph_as_text(entity, depth=depth)
        if ctx:
            context_parts.append(ctx)
    return "\n\n".join(context_parts)

def vector_retrieve(query: str, chunks: list[str], top_k: int = 4) -> list[str]:
    """Standard cosine similarity retrieval over a chunk store."""
    query_emb  = embedder.encode(query, normalize_embeddings=True)
    chunk_embs = embedder.encode(chunks, normalize_embeddings=True)
    scores = chunk_embs @ query_emb
    top_idx = scores.argsort()[-top_k:][::-1]
    return [chunks[i] for i in top_idx if scores[i] > 0.4]

def graph_rag_answer(
    query:  str,
    graph:  KnowledgeGraph,
    chunks: list[str],        # plain text chunks from your document store
    system: str = "You are a helpful assistant with access to a knowledge graph and document context.",
) -> str:
    graph_ctx  = graph_retrieve(query, graph)
    vector_ctx = "\n\n".join(vector_retrieve(query, chunks))

    context = ""
    if graph_ctx:
        context += f"[Knowledge Graph Context]\n{graph_ctx}\n\n"
    if vector_ctx:
        context += f"[Document Context]\n{vector_ctx}"

    response = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1024,
        system=system,
        messages=[{
            "role": "user",
            "content": f"{context}\n\nQuestion: {query}"
        }],
    )
    return response.content[0].text

7. Production with Neo4j

NetworkX keeps the entire graph in RAM and is not persistent across restarts. For production corpora (millions of nodes), use Neo4j, which provides ACID transactions, full-text search, vector indexing, and native Cypher traversal that scales horizontally.


# Run Neo4j locally
docker run \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:5

pip install neo4j

# neo4j_graph.py
from neo4j import GraphDatabase
from extractor import Triple

class Neo4jKnowledgeGraph:
    def __init__(self, uri: str = "bolt://localhost:7687",
                 user: str = "neo4j", password: str = "password"):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self) -> None:
        self.driver.close()

    def add_triple(self, t: Triple) -> None:
        with self.driver.session() as session:
            session.run(
                """
                MERGE (s:Entity {name: $subject})
                MERGE (o:Entity {name: $obj})
                MERGE (s)-[r:RELATION {type: $predicate}]->(o)
                SET r.source = $source
                """,
                subject=t.subject, obj=t.object,
                predicate=t.predicate, source=t.source,
            )

    def neighbors(self, entity: str, depth: int = 2) -> list[dict]:
        with self.driver.session() as session:
            result = session.run(
                """
                MATCH path = (start:Entity {name: $entity})-[*1..$depth]-(end:Entity)
                RETURN path LIMIT 100
                """,
                entity=entity, depth=depth,
            )
            triples = []
            for record in result:
                path = record["path"]
                nodes = [n["name"] for n in path.nodes]
                rels  = [r["type"] for r in path.relationships]
                for i, rel in enumerate(rels):
                    triples.append({"from": nodes[i], "relation": rel, "to": nodes[i+1]})
            return triples

    def find_paths(self, source: str, target: str, max_depth: int = 4) -> list[list[str]]:
        with self.driver.session() as session:
            result = session.run(
                """
                MATCH path = allShortestPaths(
                  (s:Entity {name: $source})-[*1..$depth]-(t:Entity {name: $target})
                )
                RETURN [n in nodes(path) | n.name] AS node_names
                LIMIT 5
                """,
                source=source, target=target, depth=max_depth,
            )
            return [record["node_names"] for record in result]

    def subgraph_as_text(self, entity: str, depth: int = 2) -> str:
        triples = self.neighbors(entity, depth)
        if not triples:
            return f"No connections found for '{entity}'."
        lines = [f"Knowledge graph context for '{entity}':"]
        for t in triples:
            lines.append(f"  {t['from']} --[{t['relation']}]--> {t['to']}")
        return "\n".join(lines)

Drop-in replacement: swap KnowledgeGraph() for Neo4jKnowledgeGraph() in graph_retrieve() and everything else remains identical.


8. Microsoft GraphRAG vs Custom Implementation

Microsoft's open-source GraphRAG (released 2024, updated 2025) provides an end-to-end implementation with community detection and two distinct search modes. Understanding the differences helps you decide whether to use it or build your own.

Feature Microsoft GraphRAG Custom implementation (this post)
Setup pip install graphrag; one config file Full control, more code to write
Community detection Built-in Leiden algorithm, cluster summaries Manual (use networkx community package if needed)
Search modes Local (entity-centric) and Global (community summaries) Custom hybrid: graph traversal + vector
Cost High: many LLM calls during indexing phase Lower: one LLM call per chunk for extraction only
Model lock-in Configured for Azure OpenAI by default Any model via Anthropic or OpenAI SDK
Best for Large document corpora, global thematic questions Smaller corpora, domain-specific relation types, custom pipelines

9. When Graph RAG Wins (and When It Does Not)

Graph RAG is not always better than standard RAG. It adds latency, complexity, and an offline indexing cost. Use it selectively.

Use Graph RAG when Stick with standard vector RAG when
Questions require connecting information across documents ("who approved X that caused Y?") Questions are self-contained and answered by a single passage
Your domain has rich, stable entity relations (org charts, system dependencies, causal chains) Document structure is flat and relation-poor (research papers, policies)
Aggregation queries are common ("list all services that depend on X") Queries are primarily semantic similarity searches over long documents
Hallucination grounding is critical and you need to trace answer provenance to specific graph edges Latency is critical; graph traversal adds 50–200ms per query

A practical heuristic: if your users frequently ask questions that start with "who", "which systems", "what caused", or "how are X and Y related", Graph RAG will meaningfully improve answer quality. If questions are mostly "what does X mean" or "summarise the policy on Y", standard vector RAG is sufficient.


10. Key Takeaways

  • Standard RAG fails on multi-hop questions. Vector similarity finds topically relevant chunks but cannot retrieve information that requires connecting multiple entities across documents.
  • Knowledge graphs store facts as (subject, predicate, object) triplets. Extract them offline using an LLM, store them in NetworkX for prototyping or Neo4j for production.
  • Hybrid retrieval combines graph traversal and vector search. Graph paths provide relational precision. Vector chunks provide semantic coverage. Both feed into the same LLM prompt.
  • Entity extraction quality determines graph quality. Invest in a strong extraction prompt and validate triplets on a sample of your corpus before building the full graph. Noisy extraction produces a noisy graph that misleads the LLM.
  • Microsoft GraphRAG is a full framework. Use it for large corpora with global thematic questions. Build custom for smaller corpora with domain-specific relation types or when you need tight cost control over the indexing phase.
  • Not every problem needs a graph. Graph RAG adds latency and an offline indexing cost. Apply it when multi-hop relational questions are a documented failure mode of your current system, not as a default upgrade.

References


Related Articles

LLM Evaluation Pipelines: How to Test What Your Model Actually Does
LLM Evaluation Pipelines: How to Test What Your Model Actually Does
Standard unit tests cannot evaluate LLM outputs because the same question can...
Read More →
Semantic Caching for LLMs: Cut Your API Bill by 60%
Semantic Caching for LLMs: Cut Your API Bill by 60%
Exact-match caching fails for LLMs because users never ask the same question...
Read More →
Found this useful?