getML: How a Custom Database Engine Achieves 1000x Speedup in Automated Feature Engineering
Hook
Most automated feature engineering tools are slow because they’re built on general-purpose databases. getML achieves up to 1000x speedups with a custom database engine specifically designed for feature generation.
Context
Feature engineering on relational data is notoriously time-consuming. Data scientists working with multiple tables—customers, transactions, products—spend significant time crafting aggregations, temporal joins, and rolling windows. Tools like featuretools automated this process, but introduced a new problem: runtime performance. A feature generation job that should take minutes can stretch into much longer periods because general-purpose databases aren’t optimized for the specific access patterns of propositionalization algorithms.
getML emerged from this performance bottleneck. Rather than building another Python library atop existing databases, the team created a C++ engine with an integrated in-memory database specifically designed for automated feature engineering. This architecture-first approach enables the kind of speedups—60x to 1000x according to their benchmarks—that fundamentally change how data scientists work. Instead of running feature generation overnight, you can iterate much faster.
Technical Insight
The core innovation in getML is its custom database engine optimized for propositionalization. When you define relationships between tables, getML uses specialized data structures that handle temporal relationships and maintain aggregation-friendly layouts in memory. The FastProp algorithm exploits this by performing aggregations directly on these structures without materializing intermediate join results.
Here’s how you’d use getML to build features from a relational schema:
import getml
# Connect to the getML engine
getml.engine.launch()
# Load your data from various sources
orders = getml.DataFrame.from_csv('orders.csv')
customers = getml.DataFrame.from_csv('customers.csv')
# Define the schema and temporal relationships
orders.set_role(['order_date'], getml.data.roles.time_stamp)
orders.set_role(['customer_id'], getml.data.roles.join_key)
orders.set_role(['revenue'], getml.data.roles.target)
customers.set_role(['customer_id'], getml.data.roles.join_key)
# Create a feature learning pipeline with FastProp
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_features=50
)
pipe = getml.Pipeline(
data_model=orders.to_placeholder('orders')
.join(customers.to_placeholder('customers'), on='customer_id'),
feature_learners=[fast_prop],
predictors=[getml.predictors.XGBoostRegressor()]
)
pipe.fit(orders, customers)
predictions = pipe.predict(orders_test, customers)
The FastProp algorithm operates by generating candidate features through aggregations (COUNT, AVG, SUM, MIN, MAX), temporal windows (rolling averages over different time horizons), and seasonal patterns. What makes it fast is that these operations happen within the custom database engine using columnar storage and vectorized execution. The engine handles temporal relationships efficiently, so when FastProp needs aggregations over time windows for each entity, the optimized data structures enable rapid computation.
The architecture follows a client-server model. On Linux, the C++ engine runs as a native process. On macOS and Windows, it runs inside Docker, which adds deployment overhead but maintains consistency. The Python API communicates with this engine and allows monitoring through the getML Monitor interface.
getML handles time series seasonality through built-in preprocessors. The Seasonal preprocessor extracts cyclical patterns (day of week, month of year) without manual feature crafting. Combined with exponentially weighted moving averages and lagged features generated by FastProp, this covers most time series feature engineering scenarios. The system automatically selects the most predictive features through embedded feature selection, preventing the dimensionality explosion that plagued earlier propositionalization approaches.
For data ingestion, getML supports multiple sources beyond CSV. You can connect directly to PostgreSQL, MySQL, MariaDB, Greenplum, SQLite databases via ODBC, and it can read Pandas DataFrames and JSON for ad-hoc analysis. The engine imports data into its in-memory format, which is where the performance advantage begins—subsequent operations work with the optimized memory layout.
Gotcha
The licensing model will surprise you: the community edition must not be used for productive purposes. The Elastic License v2 allows development and research, but the professional and Enterprise versions are required for production use. This is stated in the LICENSE.txt file and documentation. For teams accustomed to truly open-source tools like featuretools or scikit-learn, this represents a significant constraint. You can prototype and prove value, but licensing conversations are needed before production deployment.
The Docker requirement on macOS and Windows creates friction. While Linux users get a native binary installable via pip, other platforms must run docker compose to start the getML service before the Python API works. This adds memory overhead, complicates CI/CD pipelines, and requires container orchestration knowledge. Teams working in pure Python environments will need infrastructure changes. The 236 GitHub stars suggest a smaller community compared to more established alternatives, which means you’ll rely more heavily on vendor documentation and official support channels.
Verdict
Use if: You’re working with complex relational databases or time series where feature engineering is the bottleneck, you’ve already tried other tools and found them too slow, and you can accommodate the licensing model for production use. The 60-1000x speedup is real and transformative for large-scale feature generation. The FastProp algorithm provides significant value if feature engineering currently represents a major time investment for your team. Skip if: You need a production-ready open-source solution without licensing constraints, you’re working on simpler datasets where other tools’ performance suffices, or you can’t accommodate Docker in your deployment pipeline on macOS/Windows. Also consider carefully if you need extensive community support—the smaller user base means you’ll rely more on vendor documentation and support contracts rather than community resources.