Building Function Calling for Open-Source LLMs: Inside Hermes-2-Pro's Tool Use Framework

Hook

While OpenAI charges per API call for function calling, Hermes-2-Pro delivers the same capability in open-source models—but the implementation reveals just how much prompt engineering magic happens behind the scenes.

Context

Function calling—the ability for language models to invoke external tools and APIs—transformed LLMs from text generators into actionable agents. OpenAI's GPT models pioneered this with a clean API that let developers define functions and have the model intelligently decide when to call them. But this capability remained locked behind proprietary APIs, with unclear costs and no control over the underlying model.

The open-source community needed a solution that could replicate function calling without the API dependency. NousResearch's Hermes-2-Pro models were specifically fine-tuned for this task, but fine-tuning a model is only half the battle. The Hermes-Function-Calling repository demonstrates the orchestration layer: how to structure prompts, parse outputs, execute functions, and feed results back into the model in a coherent loop. It's a practical blueprint for anyone building agentic capabilities into self-hosted LLMs, showing that function calling isn't magic—it's careful prompt design and execution management.

Technical Insight

The architecture centers on a recursive execution loop that alternates between model inference and function execution. At the heart of the system is the ChatML prompt format, which wraps tool definitions and conversation history in structured XML-like tags. When you define a function using the @tool decorator, the framework automatically generates an OpenAI-compatible JSON schema that gets injected into the system prompt.

Here's how you define a callable function:

@tool
def get_stock_price(symbol: str) -> dict:
    """Retrieves the current stock price for a given ticker symbol.
    
    Args:
        symbol: Stock ticker symbol (e.g., 'AAPL', 'GOOGL')
    
    Returns:
        Dictionary containing price and metadata
    """
    ticker = yf.Ticker(symbol)
    data = ticker.history(period='1d')
    return {
        'symbol': symbol,
        'price': float(data['Close'].iloc[-1]),
        'volume': int(data['Volume'].iloc[-1])
    }

The decorator extracts the function signature, docstring, and type hints to automatically construct the tool schema. When a user asks "What's Apple's current stock price?", the model receives a system prompt containing all available tool definitions, then generates output wrapped in <tool_call> tags with JSON-formatted parameters.

The execution loop is elegantly simple but powerful. After the model generates a tool call, the framework parses the JSON, executes the corresponding Python function, and feeds the result back wrapped in <tool_response> tags. The model then sees this result as part of the conversation history and can either make another tool call or generate a final natural language response. The recursive depth is configurable, preventing infinite loops while allowing multi-step reasoning.

What makes this particularly clever is the dual-mode support. Beyond function calling mode, the framework implements a JSON mode that uses Pydantic schemas for validation. Instead of generating tool calls, the model outputs raw JSON that conforms to a predefined schema. This is useful when you need structured data extraction rather than function execution:

from pydantic import BaseModel

class StockAnalysis(BaseModel):
    sentiment: str
    risk_level: int
    recommendation: str

result = generate_json(
    prompt="Analyze TSLA stock outlook",
    schema=StockAnalysis,
    model="NousResearch/Hermes-2-Pro-Llama-3-8B"
)

The ChatML format is crucial to reliability. Unlike free-form prompting where the model might hallucinate function names or malform JSON, the structured tags provide clear boundaries. The prompt explicitly instructs the model that tool calls must appear between <tool_call> and </tool_call> tags, and each call must be valid JSON. The fine-tuning of Hermes-2-Pro models specifically optimized them to follow these formatting rules consistently.

The implementation handles parallel function calls gracefully. If the model generates multiple <tool_call> blocks in one response, the framework executes them sequentially and aggregates results before the next inference pass. This enables queries like "Compare the stock prices of Apple, Google, and Microsoft" to resolve in a single round trip rather than requiring three separate recursive calls.

One architectural choice worth noting is the stateless design of individual functions. Each function receives only its declared parameters and returns only its output—no shared state or context beyond what's explicitly passed. This makes functions easy to test, debug, and swap out, but it means the model itself must maintain context across calls through the conversation history rather than relying on a persistent execution environment.

Gotcha

The financial domain examples are both a strength and a limitation. While the yfinance integration demonstrates real-world utility, extending beyond stock market queries requires manual coding of new functions. There's no plugin system or dynamic function registration—you edit the Python source and add decorated functions. For production applications spanning multiple domains, you'd need to build your own function registry and schema management.

Error handling is conspicuously minimal. If the model hallucinates a function name that doesn't exist, or generates malformed JSON that fails to parse, the recursive loop can fail ungracefully. There's no retry logic, no fallback to asking the model to correct its output, and no validation that requested functions are actually available. In practice, this means you need to wrap the execution in your own exception handling and potentially implement a validation layer that checks tool calls against available functions before execution.

The dependency on Hermes-2-Pro models and ChatML format creates portability challenges. If you want to use different models—even other Llama-3 variants—you'll need to re-prompt or fine-tune them for the ChatML structure. The framework isn't model-agnostic like LangChain's tool abstractions. Additionally, the examples all use local inference or specific API endpoints; there's no clean abstraction for swapping between inference backends (vLLM, Ollama, TGI, etc.). You're essentially getting a reference implementation rather than a production-ready library.

Verdict

Use if: You're building custom agentic applications on top of Hermes-2-Pro models and need a working template for function calling implementation, you want to understand the mechanics of prompt-based tool use without framework abstraction hiding the details, or you need to implement function calling in a self-hosted environment where you control the entire stack from model to execution. Skip if: You need production-ready function calling with comprehensive error handling and broad model support (use LangChain or the OpenAI API instead), your use case extends beyond the financial domain examples and you don't want to manually code every function, or you're using models other than Hermes-2-Pro that weren't specifically fine-tuned for ChatML-style function calling. This repository shines as educational material and a starting point for custom implementations, but it's not a batteries-included library for general-purpose agentic workflows.

Building Function Calling for Open-Source LLMs: Inside Hermes-2-Pro's Tool Use Framework

Building Function Calling for Open-Source LLMs: Inside Hermes-2-Pro's Tool Use Framework

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

Building Function Calling for Open-Source LLMs: Inside Hermes-2-Pro's Tool Use Framework

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

How Ripgrep Makes Searching 10x Faster Than Grep: A Deep Dive Into Rust-Powered Text Search

Open Interpreter: Running GPT-4 with Root Access to Your Machine

Accomplish: Why Wrapping OpenCode Instead of Building an Agent Runtime Was the Right Bet

NVIDIA Cosmos: A Case Study in Strategic Repository Deprecation

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]