Back to Articles

Building Interactive Knowledge Graphs from Text: A Three-Phase LLM Pipeline

[ View on GitHub ]

Building Interactive Knowledge Graphs from Text: A Three-Phase LLM Pipeline

Hook

Most knowledge extraction tools give you a list of entities and call it done. This one goes further: it chunks your document, extracts relationships, then uses community detection and transitive reasoning to infer connections that were never explicitly stated.

Context

Unstructured text is where most organizational knowledge lives—meeting transcripts, research papers, technical documentation, customer feedback. But text is linear, and knowledge is networked. You can't see patterns, clusters, or implicit connections by reading sequentially.

Traditional approaches to knowledge graph construction fell into two camps: rule-based systems that were precise but brittle, requiring extensive entity dictionaries and relationship templates, or full-stack graph database solutions that demanded significant infrastructure and schema design upfront. LLMs changed the game by making entity and relationship extraction possible from arbitrary text without predefined ontologies. But early LLM-based extractors hit a wall with longer documents due to context limits, and they only captured what was explicitly stated, missing the implied connections that make knowledge graphs valuable. The ai-knowledge-graph project addresses both issues with a multi-phase pipeline that processes documents in overlapping chunks, then applies global standardization and relationship inference to reconstruct the complete knowledge network.

Technical Insight

The architecture is deliberately split into three phases, each solving a specific problem in the extraction-to-visualization pipeline. Phase 1 handles the context window limitation by chunking input text with configurable overlap, then extracting Subject-Predicate-Object triples from each chunk via LLM calls. The chunking strategy is critical: overlap ensures that entities mentioned near chunk boundaries appear in multiple contexts, reducing fragmentation.

Here's what a minimal extraction workflow looks like:

from ai_knowledge_graph import KnowledgeGraph

# Initialize with your LLM endpoint (works with Ollama, OpenAI, etc.)
kg = KnowledgeGraph(
    api_base="http://localhost:11434/v1",
    model="llama3.1:8b",
    chunk_size=1000,
    chunk_overlap=200
)

# Extract triples from your document
with open("research_paper.txt", "r") as f:
    text = f.read()

triples = kg.extract_triples(text)
# Returns: [("GPT-4", "uses", "transformer architecture"),
#           ("transformer architecture", "enables", "parallel processing"), ...]

Phase 2 tackles entity normalization—the same entity might be extracted as "GPT-4", "GPT4", and "OpenAI's GPT-4" from different chunks. The system applies text normalization (case folding, whitespace handling) and optional LLM-based disambiguation. This is where the tool makes a smart tradeoff: it can use the LLM to merge semantically similar entities, but this adds cost and latency. For many use cases, simple string normalization plus fuzzy matching is sufficient.

Phase 3 is where things get interesting. The inference engine detects disconnected communities in the graph using Louvain clustering, then applies three strategies to reconnect them: transitive relationship rules (if A→B and B→C, infer A→C for certain predicates), lexical similarity matching between entities in different communities, and LLM-prompted relationship discovery. The example outputs show this can add 370+ relationships to a graph that started with a few dozen extracted triples.

The visualization layer uses Pyvis to generate interactive HTML with physics-based layouts. Each community gets a color, and the force-directed algorithm naturally clusters related entities. Hovering reveals relationship labels, and you can drag nodes to explore the network structure. This is significantly more accessible than requiring Neo4j Browser or Gephi.

The flexible backend support is implemented through OpenAI-compatible API endpoints, making it trivial to swap providers:

# Local Ollama
kg = KnowledgeGraph(api_base="http://localhost:11434/v1", model="mistral")

# OpenAI
kg = KnowledgeGraph(api_base="https://api.openai.com/v1", model="gpt-4", api_key=os.getenv("OPENAI_KEY"))

# vLLM inference server
kg = KnowledgeGraph(api_base="http://your-vllm-server:8000/v1", model="meta-llama/Llama-2-70b")

The chunking overlap parameter deserves special attention. Set it too low, and entities spanning chunk boundaries get fragmented into disconnected subgraphs. Set it too high, and you waste tokens re-processing the same text. The default 200 tokens with 1000-token chunks (20% overlap) is a reasonable starting point, but documents with dense entity references may need 30-40% overlap to maintain coherence.

One architectural decision that stands out: the tool keeps graph construction (NetworkX) separate from visualization (Pyvis). This means you can export the NetworkX graph object and perform custom analytics—centrality measures, path finding, subgraph extraction—before or instead of generating the HTML visualization. It's not locked into a single output format.

Gotcha

The relationship inference is powerful but undisciplined. In testing, the tool added 370 inferred relationships to a graph with ~100 extracted ones—a 3.7x multiplier. Some of these are genuinely valuable implicit connections. Others are speculative leaps that the LLM makes based on entity name similarity or common-sense reasoning that may not apply to your domain. There's no confidence scoring, no human-in-the-loop validation, and no way to distinguish between "extracted from text" and "inferred by algorithm" relationships in the output graph.

Processing large documents is sequential and slow. A 50-page document might generate 40+ chunks, each requiring an LLM API call. With a local Ollama model on consumer hardware, expect 5-15 seconds per chunk. With OpenAI's API, you're paying per token for potentially redundant processing of overlapped text. The codebase shows no evidence of batch processing, parallel LLM calls, or result caching. If you're processing a corpus of documents, you'll need to add your own orchestration layer with rate limiting and retry logic.

Verdict

Use if: You need to quickly explore relationships in medium-sized documents (5-50 pages), you have access to a capable local LLM or budget for API calls, and you're doing knowledge discovery or research synthesis where false positives in inferred relationships are acceptable noise. The interactive HTML output makes this particularly valuable for creating shareable visualizations from meeting notes, literature reviews, or strategic documents. Skip if: You need production-grade accuracy with verifiable provenance for each relationship, you're processing thousands of documents where cost and speed matter, or you need formal ontology compliance. The aggressive inference features make this an exploration tool, not a reliable knowledge base builder.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/data-knowledge/robert-mcdermott-ai-knowledge-graph.svg)](https://starlog.is/api/badge-click/data-knowledge/robert-mcdermott-ai-knowledge-graph)