npcpy: Enforcing LLM Behavior Through Code, Not Prompts

Hook

What if you could guarantee your LLM agent won't hallucinate API calls—not through clever prompting, but by making it physically impossible at the architecture level?

Context

Every developer who's built with LLMs knows the pain: you craft the perfect prompt, your agent works beautifully in testing, then it hallucinates a filesystem path in production or invents an API endpoint that doesn't exist. The industry's answer has been increasingly elaborate prompt engineering—system prompts, few-shot examples, chain-of-thought reasoning, constitutional AI principles. It's prompt archaeology layered on prompt archaeology.

npcpy takes a fundamentally different approach. Instead of asking the LLM nicely to follow rules, it enforces constraints through software architecture. The Context-Agent-Tool paradigm treats LLMs as language engines wrapped in programmatic guardrails. Want your agent to only access specific APIs? Don't prompt it to behave—give it a Tools object that physically cannot call anything else. Need structured output? Don't beg for JSON—parse and validate at the framework level. It's the difference between asking a junior developer to follow coding standards versus running a linter that blocks their commit. This architectural philosophy, combined with unified interfaces across local and cloud LLM providers, makes npcpy particularly interesting for research teams and developers prototyping agent systems who are tired of prompt whack-a-mole.

Technical Insight

The framework's core abstraction is elegantly simple: Context objects store state and memory, Agent objects orchestrate tool execution, and Tool objects define bounded capabilities. An NPC (the base primitive) is essentially a persona-wrapped LLM client that maintains conversational context without autonomous tool use. Here's the foundational pattern:

from npcpy import NPC

# Persona constraints are code, not prompts
researcher = NPC(
    name="research_assistant",
    model="llama3.2",
    persona="You are a researcher who cites sources.",
    provider="ollama"
)

response = researcher.chat("Explain transformer attention")
print(response.content)

This looks mundane until you see how it scales to Agents with tool execution. The CodingAgent automatically executes code blocks from LLM responses, but here's where the architecture shines—execution happens in language-specific sandboxes:

from npcpy import CodingAgent

coder = CodingAgent(
    name="data_analyst",
    model="gpt-4",
    tools=["python", "shell"],
    auto_execute=True,
    sandbox_mode="docker"  # Isolation at runtime
)

# The agent can reason about code AND execute it
result = coder.chat(
    "Analyze this CSV and plot the distribution",
    context={"file_path": "/data/sales.csv"}
)
# Code blocks in LLM response are automatically executed
# Results are fed back into the conversation context

The auto-execution is controversial but powerful—the agent generates Python, the framework runs it in a sandboxed environment, captures stdout/stderr, and feeds results back to the LLM for continued reasoning. It's a tight REPL loop that eliminates the "generate code, manually run it, paste results back" dance.

Where npcpy gets genuinely interesting is multi-agent coordination through NPCArray. Instead of manually orchestrating agent conversations, you define a collection and let the framework handle turn management:

from npcpy import NPCArray, NPC

# Create specialized agents
researcher = NPC(name="researcher", model="llama3.2", 
                 persona="Find evidence and cite sources")
critic = NPC(name="critic", model="mistral",
             persona="Challenge assumptions, find logical holes")
synthesizer = NPC(name="synthesizer", model="gpt-4",
                  persona="Integrate perspectives into consensus")

# Multi-agent debate system
team = NPCArray([researcher, critic, synthesizer])
conclusion = team.debate(
    "Should we migrate this service to Rust?",
    rounds=3,
    voting_mechanism="consensus"
)

Under the hood, NPCArray manages conversation state, routes messages between agents, and tracks decision convergence. Each agent maintains isolated context until synthesis steps. This is miles simpler than manually implementing multi-agent protocols with LangChain or AutoGen.

The knowledge graph integration is another standout. Rather than bolting on a vector database for RAG, npcpy treats structured knowledge as a first-class citizen:

from npcpy import Agent, KnowledgeGraph

kg = KnowledgeGraph(backend="neo4j")
kg.add_entity("FastAPI", type="framework", properties={"language": "Python"})
kg.add_entity("Pydantic", type="library")
kg.add_relationship("FastAPI", "depends_on", "Pydantic")

agent = Agent(
    name="tech_advisor",
    model="gpt-4",
    knowledge_graph=kg
)

# Agent can query structured relationships, not just embeddings
response = agent.chat("What are FastAPI's dependencies?")

The framework queries the graph based on entities detected in the user message, injects structured context into the prompt, and the LLM reasons over relationships rather than fuzzy semantic search. For technical documentation, API mappings, or any domain with explicit relationships, this is far more reliable than pure vector similarity.

The MCP (Model Context Protocol) support is forward-looking—npcpy can act as both client and server, meaning agents built with it can expose tools to other MCP-compatible systems or consume tools from external MCP servers. This interoperability layer suggests the authors are thinking beyond monolithic agent frameworks toward composable agent ecosystems.

Gotcha

The most glaring issue is documentation maturity. The README provides tantalizing examples but cuts off mid-explanation in several sections, leaving you to spelunk through source code to understand advanced usage. Error messages are often cryptic, and there's minimal guidance on production deployment patterns. This screams "research artifact" rather than "production library."

The auto-execution feature in CodingAgent is powerful but terrifying. Even with sandboxing, you're letting LLM-generated code run in your environment. The framework provides Docker-based isolation, but the defaults are permissive—you need to explicitly configure restrictive sandboxes. There's no audit logging of executed code blocks by default, no built-in rate limiting on execution, and unclear handling of infinite loops or resource exhaustion. If you're not carefully configuring execution boundaries, you're one hallucinated rm -rf away from a bad day. The security model assumes you trust your LLM provider and your prompts, which is optimistic at best.

Performance characteristics are undocumented. How does NPCArray handle 10 agents? 100? What's the memory footprint of maintaining multiple conversation contexts? The knowledge graph integration is elegant but there's no guidance on indexing strategies, query optimization, or scaling beyond toy examples. You're left to discover these boundaries through trial and error.

Verdict

Use if: You're prototyping multi-agent systems in a research context, need unified access to multiple LLM providers without vendor lock-in, want to experiment with structured knowledge graphs for agent memory, or you're tired of prompt engineering fragility and want architectural guardrails instead. It's exceptional for rapid experimentation where you value flexibility over stability. Skip if: You need production-grade reliability with comprehensive documentation, can't accept the security risks of auto-executing LLM-generated code, require proven scaling characteristics, or you're building customer-facing systems where failure modes need to be well-understood. For production, stick with LangChain's battle-tested ecosystem or AutoGen's Microsoft backing. For research and prototyping where npcpy's architectural opinions align with your mental model, it's a refreshingly different take on agent frameworks that might save you from prompt hell.

npcpy: Enforcing LLM Behavior Through Code, Not Prompts

npcpy: Enforcing LLM Behavior Through Code, Not Prompts

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

npcpy: Enforcing LLM Behavior Through Code, Not Prompts

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

Big List of Naughty Strings: The Test Dataset That Breaks Your Input Validation

4D Gaussian Splatting: How Hexplane Factorization Makes Real-Time Dynamic Scene Rendering Possible

Honcho: The Peer Memory Graph That Replaces RAG for Long-Running Agents

NocoDB: The Self-Hosted Database That Speaks Spreadsheet

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]