> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

StructGPT: Teaching ChatGPT to Query Databases Without Fine-Tuning

[ View on GitHub ]

StructGPT: Teaching ChatGPT to Query Databases Without Fine-Tuning

Hook

What if you could give ChatGPT access to your entire SQL database without exceeding context limits or fine-tuning a single parameter? StructGPT proves it's possible through a clever iterative reasoning loop.

Context

Large Language Models excel at natural language but struggle with structured data. The naive approach—dumping entire database schemas or knowledge graphs into prompts—quickly overwhelms context windows and produces hallucinated SQL queries. Fine-tuning models like T5 or BERT on Text-to-SQL datasets works, but requires expensive retraining for each new database schema and loses the general reasoning capabilities of frontier models like GPT-4.

StructGPT from RUCAIBox introduces a third path: teach LLMs to interact with structured data through specialized interfaces, just like human developers do. Instead of seeing raw tables, the model learns to invoke SQL queries, table lookups, and knowledge graph traversals, receiving only the relevant results. This iterative 'Invoking-Linearization-Generation' pattern lets ChatGPT reason over databases it has never seen, achieving competitive performance on benchmark tasks like Spider (Text-to-SQL), TabFact (table verification), and WebQSP (knowledge graph question answering) without any task-specific training.

Technical Insight

The core architecture revolves around three repeating steps that mirror how experienced developers interact with unfamiliar databases. First, the LLM invokes a specialized interface—SQL execution for relational databases, direct lookup for tables, or entity traversal for knowledge graphs. Second, the framework linearizes the structured response into natural language the LLM can process. Third, the LLM generates either the next query or a final answer. This loop continues until the model has sufficient information to respond.

Here's how StructGPT handles a multi-hop knowledge graph query from MetaQA: 'What movies did the director of Sharknado also direct?' The framework provides the model with access to three KG operations:

# Available KG interfaces in StructGPT
def get_relations(entity):
    """Returns all relation types connected to an entity"""
    pass

def get_neighbors(entity, relation):
    """Returns entities connected via a specific relation"""
    pass

def get_attributes(entity):
    """Returns attribute values for an entity"""
    pass

The LLM's reasoning proceeds iteratively. First invocation: get_neighbors('Sharknado', 'directed_by') returns ['Anthony C. Ferrante']. The linearized response becomes: 'The director of Sharknado is Anthony C. Ferrante.' Second invocation: get_neighbors('Anthony C. Ferrante', 'directed') returns the full filmography. The model then synthesizes the final answer from accumulated context.

For SQL databases, StructGPT takes a different approach. Rather than invoking individual operations, the LLM generates complete SQL queries that the framework executes against the actual database. The key innovation is the prompt structure that teaches ChatGPT to be conservative:

# Simplified prompt template for SQL generation
prompt = f"""
You have access to a database with this schema:
{schema_description}

Previous queries and results:
{history}

Question: {user_question}

Generate a SQL query to find relevant information, or provide the final answer if you have enough information.
If uncertain, query for clarifying information first.
"""

This conservative prompting strategy reduces hallucination. On the Spider benchmark, StructGPT achieved 58.1% execution accuracy with ChatGPT, competitive with specialized fine-tuned models from the GPT-3.5 era, despite having zero training on SQL data.

The linearization step deserves special attention because it solves a critical problem: structured data is dense and repetitive. A SQL result with 50 rows would consume thousands of tokens if naively converted to text. StructGPT applies smart truncation, showing only the first few rows and indicating truncation with metadata like '(47 more rows)'. For knowledge graphs, linearization converts triples into natural sentences: (Anthony_C._Ferrante, directed, Sharknado_2) becomes 'Anthony C. Ferrante directed Sharknado 2.'

The framework maintains a conversation history across iterations, allowing the LLM to build up context gradually. This history includes both the linearized data and the model's reasoning, creating a scratchpad effect similar to chain-of-thought prompting. On complex multi-hop questions from MetaQA's 3-hop split, StructGPT achieved 74.4% accuracy by chaining together three separate knowledge graph lookups, demonstrating genuine iterative reasoning rather than single-shot query generation.

Gotcha

Reproducibility is StructGPT's Achilles heel. The repository documentation explicitly warns that results are non-deterministic due to ChatGPT's inherent randomness. Running the same query twice can yield different SQL, different reasoning paths, and ultimately different answers. For research papers, this is manageable with enough trials. For production systems where users expect consistent behavior, it's a dealbreaker. The framework provides no determinism controls, no temperature settings propagation, and no retry logic for validation.

Cost and API dependency create practical barriers. Every query requires multiple OpenAI API calls—one per reasoning iteration. A single complex question might invoke ChatGPT five times. The repository offers no local model support, no caching layer, and no batch processing optimizations. When I examined the codebase, the SQL execution happens synchronously with no connection pooling, meaning each database query blocks on LLM response. For the WebQSP knowledge graph dataset with 4,737 questions, a full evaluation could easily cost hundreds of dollars at current API pricing. The framework also lacks graceful degradation—if the OpenAI API is down or rate-limited, your entire pipeline stops. There's no fallback to simpler heuristics or cached results.

Verdict

Use if: You're prototyping a research system that needs to demonstrate LLM reasoning over structured data, you have OpenAI API budget, and you value implementation speed over production reliability. StructGPT shines for academic experiments comparing LLM capabilities across multiple structured data types, or for internal demos showing how ChatGPT could interact with your company's database schema. The framework's generality means you can adapt the same core code to SQL, tables, and knowledge graphs with minimal changes. Skip if: You need deterministic results for production applications, want to avoid API costs and vendor lock-in, or require support for local open-source models. The non-reproducibility alone disqualifies StructGPT from most user-facing applications. For production Text-to-SQL, consider fine-tuned models like CodeLlama with constrained decoding, or newer frameworks like LangChain's SQL agents that support multiple LLM backends and provide better error handling. If research is your goal but reproducibility matters, look at UnifiedSKG which offers similar multi-task structured reasoning with controllable fine-tuning.