> your AI agent picks dependencies from memory; give it dated facts — try starlog.dev ↗ vet your agent's deps ↗ vibe-coding is fine. vibe-importing isn’t. — try starlog.dev ↗ vibe-importing isn’t fine ↗ your agent has never seen your private packages — try starlog.dev ↗ facts for private packages ↗ a linter for the dependencies your AI agent picks — try starlog.dev ↗ a linter for agent deps ↗

Back to Articles

ELT-Bench: The First Realistic Benchmark for AI Agents Building Data Pipelines

[ View on GitHub ]

ELT-Bench: The First Realistic Benchmark for AI Agents Building Data Pipelines

Hook

While AI agents can now write code and debug software, can they handle the messier reality of extracting data from a legacy MySQL database, transforming it for analytics, and loading it into a data warehouse—all while navigating the quirks of production ELT tools?

Context

The explosion of AI coding agents has produced impressive benchmarks like SWE-bench for GitHub issues and HumanEval for algorithmic challenges. Yet data engineering—a field where practitioners spend countless hours wrangling connectors, debugging schema mismatches, and orchestrating ETL pipelines—has remained largely untested territory for AI agents. This gap matters because data engineering workflows differ fundamentally from pure software engineering: they require understanding diverse data sources, navigating vendor-specific tooling like Airbyte or Fivetran, managing credentials and connections, and reasoning about data transformations across heterogeneous systems.

ELT-Bench emerges from UIUC's Kang Lab to address this evaluation blind spot. Unlike general-purpose agent benchmarks, it evaluates agents on the specific workflow of Extract-Load-Transform pipelines using production-grade tools. The benchmark provisions real data sources in Docker containers, uses Airbyte as the orchestration layer (mirroring how modern data teams actually build pipelines), and validates results in Snowflake. This isn't a simulated environment with mocked APIs—it's the actual toolchain data engineers use daily, making it a high-fidelity test of whether AI agents can handle real-world data infrastructure tasks.

Technical Insight

ELT-Bench's architecture revolves around three distinct phases that mirror the lifecycle of evaluating an agent's data engineering capabilities. The setup phase provisions Docker containers for various data sources—PostgreSQL databases, MySQL instances, REST APIs—and populates them with test datasets. Simultaneously, it executes the ground truth pipeline using Airbyte to establish the expected output state in Snowflake. This dual setup creates both the problem space (raw data sources) and the answer key (correctly processed data in the warehouse).

The agent execution phase is where the evaluation happens. An AI agent receives a task description—for example, "Extract customer orders from the PostgreSQL database, load them into Snowflake, and ensure the timestamp fields are properly converted"—and must interact with Airbyte's API or UI to configure sources, destinations, and connections. The agent doesn't just generate code; it must navigate Airbyte's connector ecosystem, configure authentication, map schema fields, and trigger sync operations. Here's what a typical agent interaction might look like:

# Agent must discover available connectors
response = airbyte_api.get('/v1/source_definitions')
postgres_connector = [c for c in response['sourceDefinitions'] 
                      if c['name'] == 'Postgres'][0]

# Configure source with connection details
source_config = {
    'sourceDefinitionId': postgres_connector['sourceDefinitionId'],
    'connectionConfiguration': {
        'host': 'source-postgres-container',
        'port': 5432,
        'database': 'orders_db',
        'username': 'readonly_user',
        'password': agent.retrieve_credential('postgres'),
        'schemas': ['public']
    },
    'name': 'Orders Source'
}
source = airbyte_api.post('/v1/sources/create', json=source_config)

# Agent must then configure destination and create connection
# This requires understanding Airbyte's relationship model
connection_config = {
    'sourceId': source['sourceId'],
    'destinationId': snowflake_destination_id,
    'syncCatalog': agent.build_sync_catalog(source),
    'schedule': {'units': 1, 'timeUnit': 'hours'},
    'status': 'active'
}

What makes this challenging is the multi-step reasoning required. The agent can't simply execute a linear script—it must query Airbyte to understand available connectors, handle authentication flows that might involve retrieving credentials from multiple sources, interpret error messages when connections fail ("Connection refused" vs "Authentication failed" vs "Schema not found"), and verify that data actually landed in Snowflake with the correct transformations.

The evaluation phase compares the agent's output against ground truth using SQL queries against Snowflake. The benchmark runs validation queries that check row counts, schema compatibility, data type conversions, and actual content. This is more nuanced than simple pass/fail: an agent might successfully move data but fail to handle NULL values correctly, or might create the right tables with wrong column types. The evaluation framework captures these gradations of success.

The repository structure reveals the infrastructure complexity required for realistic evaluation. The docker-compose.yml defines multiple source containers, each pre-loaded with test data. The airbyte/ directory contains configuration templates for different connector types. The evaluation/ module includes SQL query templates that adapt based on which sources and transformations were configured. This isn't a lightweight benchmark you can run in a GitHub Actions runner—it requires Docker, a running Airbyte instance, Snowflake credentials, and potentially several gigabytes of container images. But this heaviness is precisely what makes it realistic: actual data engineering involves orchestrating these exact components.

The benchmark's design also reveals an architectural choice about agent capabilities. Rather than providing a simplified Python SDK that abstracts Airbyte's complexity, ELT-Bench forces agents to interact with Airbyte as a data engineer would—through its REST API or configuration files. This tests whether agents can handle the impedance mismatch between natural language task descriptions and vendor-specific APIs, a core challenge in applying AI to infrastructure automation.

Gotcha

The setup complexity is not for the faint of heart. You'll need Docker with sufficient resources to run multiple database containers simultaneously, a working Airbyte deployment (which itself requires docker-compose orchestration), conda for Python environment management, PostgreSQL client tools for validation queries, and a Snowflake account with appropriate permissions. The README's setup instructions span multiple pages, and there's no "quick start" option. If any component fails—say, Airbyte doesn't start cleanly or Snowflake credentials aren't configured correctly—debugging requires understanding each piece of the stack. For researchers wanting to quickly prototype agent improvements, this overhead is significant.

The benchmark also suffers from documentation sparseness. The repository lacks details on the actual task suite: How many benchmark tasks are included? What's the difficulty distribution? What data domains are covered? Are tasks isolated or do some build on others? Without this context, it's difficult to assess whether an agent's performance on ELT-Bench generalizes to real-world data engineering scenarios or simply overfits to specific Airbyte connector types. The limited community adoption (24 stars at time of writing) means there's less community-contributed documentation or troubleshooting guidance than more established benchmarks offer.

Verdict

Use if: You're actively researching AI agents for infrastructure automation and need to evaluate performance on realistic data engineering workflows. The production-tool approach provides genuinely useful signal about whether agents can handle the complexity of modern data stacks. Also use it if you're building commercial products around AI-powered data pipeline automation and need standardized metrics to track improvement—the ground truth evaluation removes subjectivity from comparing approaches. Skip if: You're doing general agent research without specific focus on data engineering, as the setup overhead isn't justified. Also skip if you lack the infrastructure (Snowflake account, Docker resources) or if you're prototyping rapidly and need faster iteration cycles. For academic research on general agent capabilities, broader benchmarks like AgentBench provide better effort-to-insight ratios. ELT-Bench's value is directly proportional to how much you care specifically about data pipeline automation.