Gorilla: Building LLMs That Actually Call APIs Without Hallucinating
Hook
Within six months of launch, Gorilla's Berkeley Function Calling Leaderboard became the de facto standard for evaluating function-calling LLMs, processing over 500,000 production requests and exposing that most models hallucinate non-existent API parameters over 40% of the time.
Context
When ChatGPT introduced function calling in June 2023, it promised a future where LLMs could interact with external tools reliably. The reality was messier. Models would confidently invoke APIs with non-existent parameters, mix up similar function signatures, or generate syntactically correct but semantically nonsensical calls. The problem wasn't just accuracy—it was trust. You couldn't deploy an LLM that might call delete_all_users() when you meant delete_user(id=123).
The root issue is that general-purpose LLMs weren't trained on the structured, precise task of API invocation. They're great at natural language but terrible at the pedantic exactness APIs demand. Gorilla, developed at Berkeley, attacked this by treating function calling as its own domain requiring specialized training, comprehensive evaluation, and safe execution infrastructure. Rather than bolting function calling onto existing models as an afterthought, Gorilla built the entire stack: training data (APIBench with 1,600+ APIs), models (fine-tuned for function calling), evaluation (the Berkeley Function Calling Leaderboard), and safe execution (GoEx runtime). What started as a research project has evolved into the most complete open-source ecosystem for tool-using LLMs.
Technical Insight
Gorilla's architecture revolves around three core innovations that work together to enable reliable function calling. First is APIBench, a meticulously curated dataset covering HuggingFace, TorchHub, and TensorHub APIs with both positive and negative examples. Unlike typical instruction-tuning datasets, APIBench includes edge cases: deprecated parameters, similar-but-different functions, and ambiguous natural language queries. The training regime fine-tunes models specifically on the translation task from natural language to structured API calls, treating it as a specialized form of code generation.
The second innovation is retrieval-augmented function calling. Instead of expecting models to memorize thousands of API signatures, Gorilla retrieves relevant API documentation at inference time and includes it in the prompt context. This 'test-time adaptation' dramatically reduces hallucination because the model has the actual API spec in front of it. Here's how it works in practice with OpenFunctions-V2:
from openai import OpenAI
client = OpenAI(
base_url="http://luigi.millennium.berkeley.edu:8000/v1",
api_key="EMPTY"
)
tools = [{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Retrieves the current stock price for a given ticker symbol",
"parameters": {
"type": "object",
"properties": {
"ticker": {"type": "string", "description": "Stock ticker symbol"},
"exchange": {"type": "string", "enum": ["NYSE", "NASDAQ"]}
},
"required": ["ticker"]
}
}
}]
response = client.chat.completions.create(
model="gorilla-openfunctions-v2",
messages=[{"role": "user", "content": "What's Apple's stock price?"}],
tools=tools,
tool_choice="auto"
)
# The model returns structured function calls
function_call = response.choices[0].message.tool_calls[0]
print(function_call.function.name) # get_stock_price
print(function_call.function.arguments) # {"ticker": "AAPL", "exchange": "NASDAQ"}
Notice the OpenAI-compatible interface—this is deliberate. OpenFunctions-V2 provides a drop-in replacement for OpenAI's function calling but runs entirely on your infrastructure. The model supports parallel function calls, multi-turn conversations, and crucially, multiple programming languages. You can pass the same tool definitions for Python functions, JavaScript methods, Java classes, or REST APIs, and the model generates the appropriate invocation syntax.
The third innovation is the Berkeley Function Calling Leaderboard (BFCL), which has become more influential than the models themselves. BFCL doesn't just test if models can call functions—it evaluates them across categories like 'Simple AST' (basic function calling), 'Multiple Functions' (choosing the right function from many options), 'Parallel Functions' (calling multiple functions in one turn), 'Parallel Multiple Functions' (the combination), and 'Relevance Detection' (knowing when NOT to call a function). The leaderboard's V4 iteration adds agentic scenarios: multi-hop reasoning where the output of one function feeds into another, error recovery when APIs return failures, and state management across conversation turns.
What makes BFCL valuable is its honesty. It exposes that proprietary models like GPT-4 still lead significantly in complex scenarios, but also shows that open models like Gorilla's OpenFunctions-V2 or Meta's Llama models achieve 80-90% of that performance at zero cost per request. You can run the evaluation suite yourself:
from bfcl import evaluate
# Evaluate your model on the leaderboard test set
results = evaluate(
model_name="your-model",
test_categories=["simple", "multiple", "parallel", "relevance"],
model_handler=your_inference_function
)
print(results.accuracy_by_category)
print(results.hallucination_rate)
print(results.ast_correctness) # Syntactic correctness
print(results.execution_correctness) # Semantic correctness
The evaluation distinguishes between AST correctness (is it valid code?) and execution correctness (does it do the right thing?). A model might generate get_user(user_id="123") when the function expects an integer—syntactically correct, semantically wrong. BFCL catches this.
Finally, GoEx (Gorilla Execution Engine) addresses the elephant in the room: what happens when your LLM-generated function call actually executes? GoEx implements 'post-facto validation' with undo mechanisms and damage confinement. Each function call runs in a constrained environment where filesystem access, network calls, and resource usage are monitored. If a call violates safety policies, GoEx can roll back changes. This is critical for autonomous agents where you can't manually review every action. The runtime tracks state transitions, maintains execution logs, and provides hooks for custom validation logic before commits become permanent.
Gotcha
Gorilla's biggest limitation is ecosystem complexity. The project has sprawled into multiple sub-projects—BFCL for evaluation, OpenFunctions for inference, GoEx for execution, Agent Arena for benchmarking, APIZoo for datasets—and the documentation doesn't always make clear which component solves which problem. If you just want basic function calling, you'll wade through references to multiple models (Gorilla-7B, Gorilla-OpenFunctions-v1, OpenFunctions-v2) and wonder which to use. The answer is OpenFunctions-v2 for most cases, but that's not immediately obvious. The original Gorilla models are largely superseded, yet the repository still prominently features them.
APIBench, while comprehensive, requires continuous maintenance. APIs evolve—parameters get deprecated, new endpoints emerge, authentication schemes change. The dataset captures a moment in time, and models trained on it will eventually drift from reality. Retrieval helps, but only if your retrieval corpus stays current. Production deployment means building infrastructure to keep API documentation synchronized, which Gorilla doesn't provide out of the box. You'll also hit authentication complexity that the examples gloss over. Real-world APIs need OAuth flows, API keys, rate limiting, and retry logic. Gorilla generates the function calls, but you're responsible for the entire execution harness around them.
Performance gaps remain between open and proprietary models, especially in ambiguous scenarios. When a user query could map to multiple functions or requires multi-hop reasoning, GPT-4 and Claude still outperform OpenFunctions-v2 by 15-20 percentage points on BFCL benchmarks. If your application has zero tolerance for function calling errors, you'll likely still need proprietary models or extensive human-in-the-loop review.
Verdict
Use Gorilla if you're building production systems that need function calling with open-source models, want full control over your inference infrastructure, or require multi-language API support (Python/Java/JavaScript/REST). It's essential for researchers working on tool-using LLMs who need the BFCL evaluation suite to benchmark progress. The leaderboard alone justifies engagement with the project—it's where the community measures function-calling competence. Use it if you need parallel function execution, have compliance requirements preventing third-party API calls, or want to understand the state-of-the-art in open function calling. Skip it if you're doing simple, low-stakes function calling where OpenAI's or Anthropic's native solutions work fine and cost isn't a concern. Skip it if you want a minimal-dependency solution without navigating a sprawling ecosystem of sub-projects. Skip it if you need the absolute highest accuracy for critical operations—proprietary models still lead in complex scenarios. And definitely skip it if you're not prepared to build execution infrastructure around the models; Gorilla generates the calls, but production-grade execution, error handling, and safety mechanisms are still your problem.