Back to Articles

Gorilla: Teaching LLMs to Actually Call APIs Without Hallucinating

[ View on GitHub ]

Gorilla: Teaching LLMs to Actually Call APIs Without Hallucinating

Hook

When LLMs gained function calling capabilities, they often hallucinated non-existent parameters and methods. Gorilla emerged from Berkeley to solve this fundamental problem: how do you teach an LLM to invoke real APIs without inventing fictional methods?

Context

Large language models excel at generating human-like text, but connecting them to external tools has been challenging. The Gorilla project from Berkeley addresses this through multiple innovations: fine-tuned models trained on APIBench (a curated dataset of 1,600+ APIs), the Berkeley Function Calling Leaderboard (BFCL) that evolved from simple accuracy tests to multi-turn agentic scenarios, OpenFunctions models providing open-source alternatives to proprietary function calling, and GoEx runtime for safe execution with undo capabilities. With 12,775 GitHub stars and having served ~500k requests, Gorilla has become a reference implementation for teaching LLMs to interact with APIs reliably.

Technical Insight

Evaluation

Training

Relevant API Docs

Function Call JSON

Validated & Safe

Rejected/Undo

Multi-turn

Multi-step Tests

Damage Confinement

User Query

API Retriever

Gorilla LLM

APIBench Dataset

1600+ APIs

Fine-tuning Process

GoEx Execution Engine

API Execution

Safety Validation

Berkeley Function

Calling Leaderboard

Accuracy & Relevance

Result to User

System architecture — auto-generated

Gorilla’s architecture tackles function calling at three levels: training, evaluation, and execution.

The training foundation is APIBench, which collects API signatures and usage patterns across HuggingFace, TorchHub, and TensorFlow Hub. The models use retrieval-augmented training, meaning they learn to adapt to new APIs at test time by retrieving relevant documentation, rather than purely memorizing training data. This addresses the challenge that APIs evolve constantly.

OpenFunctions-V2 represents the production-ready component. Unlike proprietary solutions, it supports multiple languages (Python, Java, JavaScript, REST), parallel function execution, and complex data types. The model provides an OpenAI-compatible endpoint, making it designed as a drop-in replacement for proprietary function calling APIs.

The Berkeley Function Calling Leaderboard evolved dramatically from V1 to V4. V1 tested simple single-turn API calls. V2 added live, enterprise-contributed scenarios. V3 introduced multi-turn, state-based evaluation where the model must track service state across conversation turns. V4 Agentic tackles real-world complexity: web search with multi-hop reasoning, memory management across sessions, and format sensitivity testing. The leaderboard evaluates not just accuracy but also relevance detection (knowing when NOT to call a function), handling of multiple valid function invocations, and parallel execution correctness.

GoEx (Gorilla Execution Engine) addresses what happens when an LLM-generated API call goes wrong. Traditional approaches validate inputs before execution, but GoEx introduces “post-facto validation”—assessing actions after they execute. According to the README, the runtime provides safety primitives including “undo” and “damage confinement” abstractions to manage unintended actions and risks. The paper describes these concepts as enabling fully autonomous LLM agents by enhancing interaction between apps and services with human-out-of-loop operation.

The Agent Arena collaboration with LMSYS adds community-driven evaluation. Rather than static benchmarks, it lets developers compare agents on real tasks (search, finance, RAG) with a ranking system based on community-driven evaluation and includes a prompt hub.

Gotcha

Gorilla’s strength—being a comprehensive research framework—creates friction for practitioners seeking plug-and-play solutions. The repository contains multiple sub-projects (APIBench datasets, BFCL evaluation harness, OpenFunctions models, GoEx runtime, Agent Arena) spread across different directories with separate documentation. Newcomers expecting a single unified tool will face a learning curve understanding which components they actually need.

The rapid evolution from V1 to V4 means earlier components may receive less maintenance attention. The original Gorilla models fine-tuned on HuggingFace/TorchHub/TensorFlow Hub APIs (announced May 2023) exist alongside the newer OpenFunctions-V2 (February 2024) and BFCL V4 Agentic (July 2025). Developers should carefully track the changelog to understand which models and evaluation approaches are current. GoEx’s undo and damage confinement abstractions are described conceptually in the README and paper, but developers will need to review the implementation details in the goex directory for practical integration patterns.

Verdict

Use Gorilla if you’re building LLM applications that need reliable function calling and you value transparency over convenience, want to fine-tune models on custom APIs, or need rigorous benchmarks to evaluate function-calling accuracy across single-turn, multi-turn, or agentic scenarios. The BFCL leaderboard is essential for researchers comparing model capabilities objectively. OpenFunctions-V2 provides an open-source alternative to proprietary function calling with Apache 2.0 licensing, making it viable for commercial use without vendor lock-in. GoEx’s safety abstractions address autonomous agent safety concerns for applications where LLM mistakes have real consequences. Skip Gorilla if you need a simple, single-purpose tool rather than a research framework with multiple components, are satisfied with proprietary solutions like OpenAI or Anthropic that offer simpler integration, only do basic prompt engineering without tool use, or lack the engineering resources to navigate a multi-faceted project. The documentation assumes familiarity with LLM research concepts. For rapid prototyping, proprietary function calling APIs may offer faster time-to-first-result, even if they sacrifice the customization and transparency that make Gorilla valuable for serious production deployments.

// ADD TO YOUR README
[![Featured on Starlog](https://starlog.is/api/badge/llm-engineering/shishirpatil-gorilla.svg)](https://starlog.is/api/badge-click/llm-engineering/shishirpatil-gorilla)