getML: The C++ Engine That Makes Feature Engineering on Relational Data 1000x Faster

Hook

Feature engineering on relational data is typically so slow that data scientists avoid it entirely, settling for denormalized single-table datasets and leaving valuable cross-table patterns undiscovered. getML claims to solve this with a 60-1000x speedup over existing tools.

Context

If you've worked with real-world enterprise data, you know it lives in multiple tables connected by foreign keys. Customer transactions reference customer profiles. Sensor readings link to equipment metadata. Events connect to user accounts. This relational structure mirrors reality, but it creates a nightmare for machine learning: most algorithms expect flat feature vectors, not normalized database schemas.

The traditional solution is automated feature engineering tools like featuretools or tsfresh. They traverse relationships, compute aggregations across tables, and generate hundreds of derived features automatically. The problem? Performance collapses on realistic datasets. Computing aggregate features across millions of rows with multiple join paths can take hours or days, making iterative experimentation impractical. Data scientists respond by either manually denormalizing everything into a single wide table (losing semantic structure) or simply avoiding complex relational features. getML emerged from this frustration: a complete rebuild of the feature engineering stack in C++ with a custom database engine designed specifically for propositionalization—the process of converting relational structures into feature vectors.

Technical Insight

getML's architecture centers on three core components: a custom in-memory database engine optimized for temporal joins, the FastProp algorithm for efficient propositionalization, and a client-server model that separates the high-performance C++ core from the Python data science interface.

The database engine is the foundation. Unlike general-purpose databases like PostgreSQL, getML's engine is built exclusively for feature engineering workloads. It maintains specialized indices for temporal relationships and foreign key traversals. When you load tables and specify join relationships, the engine pre-computes data structures that enable rapid aggregation across related tables. This is fundamentally different from featuretools, which operates on Pandas DataFrames and performs joins on-the-fly during feature computation.

Here's what a typical getML pipeline looks like:

import getml
from getml import data
from getml.feature_learning import FastProp

# Connect to the engine (runs as Docker container on macOS/Windows)
getml.engine.launch()

# Define relational schema
population = data.DataFrame.from_pandas(
    population_df,
    name="population",
    roles={
        "join_key": ["customer_id"],
        "time_stamp": ["signup_date"],
        "target": ["churned"]
    }
)

transactions = data.DataFrame.from_pandas(
    transactions_df,
    name="transactions",
    roles={
        "join_key": ["customer_id"],
        "time_stamp": ["transaction_date"],
        "numerical": ["amount", "items"]
    }
)

# FastProp automatically discovers features
fast_prop = FastProp(
    loss_function="CrossEntropyLoss",
    num_features=50,
    seasonal=["day_of_week", "month"],  # Time-aware features
    delta_t=[7, 30, 90]  # Temporal windows in days
)

pipe = getml.Pipeline(
    population=population,
    peripheral=[transactions],
    feature_learners=[fast_prop],
    predictors=[getml.predictors.XGBoostClassifier()]
)

pipe.fit(population, transactions)
predictions = pipe.predict(population_test, transactions_test)

The FastProp algorithm is where the magic happens. Unlike traditional depth-first search approaches that enumerate all possible join paths and aggregations, FastProp uses a gradient-based search to identify which features actually improve prediction quality. It maintains a feature pool, evaluates candidates in batches using the gradient of the loss function, and iteratively selects the most promising features. This heuristic search trades exhaustive enumeration for speed, generating a smaller set of high-quality features rather than thousands of mediocre ones.

The temporal handling deserves special attention. When you specify delta_t=[7, 30, 90], getML automatically creates features that aggregate data from time windows: "sum of transaction amounts in the 7 days before signup", "count of transactions in the 30 days before signup", etc. The seasonal parameter generates periodic features that capture day-of-week or monthly patterns. This is implemented efficiently because the database engine maintains temporal indices—it knows how to quickly filter related rows by time ranges without scanning entire tables.

The client-server architecture means the Python code you write sends commands to the C++ engine over IPC (or TCP for remote engines). Your DataFrames are transferred to the engine's memory space where all heavy computation happens. This design has implications: you can't directly manipulate features using arbitrary Python functions, but you gain the ability to work with datasets larger than your Python process's memory and leverage highly optimized C++ code paths.

One underrated feature is the built-in monitoring interface. After fitting a pipeline, you can launch a web UI that visualizes feature importance, shows SQL-like representations of generated features, and displays model performance metrics. For example, you might see that FastProp generated a feature defined as "AVG(transactions.amount WHERE transactions.transaction_date BETWEEN population.signup_date - 30 AND population.signup_date)". This transparency is critical because automated feature engineering often feels like a black box—you get better accuracy but lose understanding of what your model actually learned.

Gotcha

The licensing restriction is the elephant in the room. The community edition explicitly prohibits production use, defining it as any application where outcomes affect business decisions or serve external users. This means you can prototype, research, and experiment freely, but the moment you want to deploy a model to production, you need a commercial license. For startups and individual developers, this creates an awkward dependency: build your entire pipeline on getML during development, then either pay for licensing or rewrite everything using featuretools before deployment. The pricing isn't publicly listed, which adds uncertainty to project planning.

Platform support reveals another constraint. On macOS and Windows, getML runs exclusively in Docker containers. While this simplifies distribution, it introduces overhead: Docker Desktop requires licensing for large companies, adds a layer of resource virtualization that impacts performance, and complicates deployment scenarios where Docker isn't available. Only Linux users get native binaries. If you're in an enterprise environment with restrictive Docker policies or working on a Windows laptop without administrator rights to install Docker, you're stuck. The smaller community size also matters practically—when you hit issues, Stack Overflow won't help much. You're dependent on the official documentation and GitHub issues, which are reasonably good but lack the depth of community knowledge around more established tools.

Verdict

Use if: You're working with multi-table relational datasets or time-series data where feature engineering is genuinely a bottleneck (hours or days with existing tools), you're in the research/prototyping phase or have budget for commercial licensing, you need sophisticated temporal features (seasonal patterns, time-windowed aggregations) without manual engineering, and you value dramatic performance improvements over ecosystem maturity. getML truly shines when you have complex join patterns across millions of rows and need rapid iteration. Skip if: You need immediate production deployment without licensing budget, you're on macOS/Windows without Docker and can't install it, you're working with simple single-table datasets where the performance difference won't matter, you require a fully permissive open-source license for compliance reasons, or you need extensive community resources and third-party integrations. For those cases, stick with featuretools despite its slower performance—it's proven, production-ready, and truly open source.

getML: The C++ Engine That Makes Feature Engineering on Relational Data 1000x Faster

getML: The C++ Engine That Makes Feature Engineering on Relational Data 1000x Faster

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]

getML: The C++ Engine That Makes Feature Engineering on Relational Data 1000x Faster

Hook

Context

Technical Insight

Gotcha

Verdict

// KNOWLEDGE GRAPH

// RELATED

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

JARVIS: The LLM-Orchestrated AI System That Pioneered Multi-Model Task Automation

ds4: The SSD-Streaming Inference Engine That Treats Your Mac's NVMe Like RAM

Harness-1: Training Search Agents with State Externalization

makemore: Understanding Language Models by Implementing Them Seven Different Ways

// CODEBASE INTELLIGENCE

Best for

Skip when

[ SIMILAR REPOS ]