ROAPI: Turn Static Files Into Production APIs Without Writing Backend Code
Hook
A single command can turn a 5GB Parquet file sitting in S3 into a production-ready API with REST, GraphQL, and SQL interfaces—no schema definitions, no ORM configuration, no backend code whatsoever.
Context
Every data team has been here: you've got datasets in cloud storage—CSV exports, Parquet files from ETL pipelines, JSON dumps from vendors—and analysts or frontend developers need to query them. The traditional path means writing a backend service: define schemas, set up database imports, write query endpoints, handle pagination, add authentication. For slowly-changing reference data like geographic boundaries, product catalogs, or historical metrics, this is engineering overhead that doesn't match the problem's complexity.
ROAPI emerged from this friction point. Built on Apache Arrow's columnar memory format and DataFusion's query engine, it's a zero-code API server that reads dataset configurations from YAML and exposes them through multiple query interfaces. Point it at files in S3, Google Cloud Storage, your local filesystem, or even Google Sheets, and it automatically infers schemas and serves queries. The architecture is deliberately constrained—read-only, in-memory, analytical queries only—which allows it to do one thing exceptionally well: make static datasets instantly queryable without custom code.
Technical Insight
ROAPI's architecture consists of three layers that work together to transform file formats into query results. At the bottom, the data layer handles loading from 15+ sources including Parquet, Delta Lake, CSV, JSON, MySQL, SQLite, and cloud blob storage. DataFusion's schema inference examines the data structure automatically, eliminating manual schema definition. The middle layer is DataFusion itself, which converts high-level queries into optimized execution plans against in-memory Arrow tables. The top layer provides five query frontends—REST, GraphQL, SQL over HTTP, FlightSQL, and PostgreSQL wire protocol—all querying the same underlying data.
Here's what a minimal configuration looks like. Create a config.yaml file:
addr: 0.0.0.0:8080
tables:
- name: "customer_events"
uri: "s3://my-bucket/events.parquet"
option:
format: "parquet"
- name: "products"
uri: "https://example.com/products.csv"
option:
format: "csv"
has_header: true
Launch the server with roapi -c config.yaml, and immediately you can query via REST:
# Filter and aggregate with SQL-style queries
curl -X POST http://localhost:8080/api/sql \
-d "SELECT product_id, COUNT(*) as event_count \
FROM customer_events \
WHERE event_type = 'purchase' \
GROUP BY product_id"
# Or use GraphQL for precise field selection
curl -X POST http://localhost:8080/api/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ products(filter: {price: {gt: 100}}) { name price category } }"}'
What makes this architecturally interesting is how ROAPI leverages Arrow's columnar format. Traditional row-oriented systems serialize each record independently, but Arrow keeps columns contiguous in memory. When you query SELECT price FROM products WHERE category = 'electronics', DataFusion scans only the category and price columns, skipping irrelevant data entirely. For analytical queries that touch few columns but many rows, this is dramatically faster than JSON parsing or row-based formats.
The dynamic table registration feature demonstrates ROAPI's runtime flexibility. You can add datasets without restarting:
curl -X POST http://localhost:8080/api/tables/sales_2024 \
-H "Content-Type: application/json" \
-d '{
"uri": "s3://analytics/sales/2024.parquet",
"option": {"format": "parquet"}
}'
This immediately makes the table queryable. Behind the scenes, ROAPI loads the file into Arrow tables, infers the schema, and registers it with DataFusion's catalog. This is particularly useful for dynamic dashboards or data exploration tools where users need to load ad-hoc datasets.
Response format negotiation is handled through Accept headers. Request Accept: application/json for standard REST clients, or Accept: application/ for Arrow IPC streams when querying from Arrow-native tools like pandas or Polars. This zero-copy serialization path means Python clients can ingest query results directly into DataFrames without JSON deserialization overhead.
For production deployments, ROAPI compiles to a single 20MB binary with no runtime dependencies. Containerized deployments are straightforward—the official Docker image is 30MB compressed. Memory usage is predictable: loaded datasets reside entirely in RAM, so a 2GB Parquet file needs roughly 2GB of memory plus query execution overhead. DataFusion's memory management includes spill-to-disk for large sorts and aggregations, but the base tables stay in-memory for consistent query latency.
Gotcha
The read-only constraint is fundamental, not a missing feature. ROAPI has no concept of INSERT, UPDATE, or DELETE because the underlying architecture assumes immutable data sources. If you need to modify data, you must update the source files and reload tables. This works fine for ETL pipelines that generate new Parquet files nightly, but breaks down for applications requiring user-generated content or real-time updates.
Memory limitations hit faster than you might expect. ROAPI loads entire datasets into RAM at startup, and while DataFusion handles query execution efficiently, there's no partial loading or pagination at the storage layer. A 10GB Parquet file means 10GB+ of RAM usage. The documentation suggests keeping datasets under 50GB, but practical limits depend on your instance size and query concurrency. If multiple simultaneous queries perform large aggregations, memory spikes can trigger OOM kills. For truly large datasets, you'll need to pre-aggregate or partition the data externally, or migrate to DuckDB or ClickHouse which support disk-backed storage.
Schema evolution can cause surprising failures. ROAPI infers schemas at load time, but if your source data changes structure—a CSV adds a column, a JSON file changes a field from integer to string—existing queries may break. There's no schema versioning or migration system. This is acceptable for controlled datasets but problematic for third-party data sources where you don't control the schema. The PostgreSQL wire protocol support also has limitations; it works with many clients but doesn't implement the full protocol, so advanced features like prepared statements or COPY operations may not work as expected.
Verdict
Use if: You need to expose static or slowly-changing datasets (reference data, historical analytics, ETL outputs, data lake files) as queryable APIs without building custom backends. ROAPI excels for prototyping data products, serving dashboards that query Parquet files in S3, or providing SQL interfaces to CSV/JSON data for analysts. It's ideal when data updates happen in batch (hourly, daily) rather than real-time, and when datasets fit comfortably in memory (under 50GB as a rule of thumb). The multi-protocol support shines when different clients need different query interfaces against the same data. Skip if: You need write operations, real-time data updates, or transactional consistency. Applications with user-generated content, collaborative editing, or fine-grained permissions should use PostgreSQL with PostgREST instead. For datasets exceeding available RAM, use DuckDB for analytical workloads or ClickHouse for massive-scale analytics. ROAPI is deliberately limited in scope, and pushing it beyond read-only analytical queries over modestly-sized datasets means fighting the architecture rather than leveraging it.