Relational Graphs End Information Loss from Manual Feature Engineering
The Hidden Cost of Manual Feature Engineering, and Why Relational Graphs Change Everything
Every organization sits on a huge amount of operational data stored in relational databases. But according to Jure Leskovec, professor at Stanford and chief scientist at Kumo.ai, today's AI is largely blind to this data. The conventional approach - manual feature engineering, table flattening, endless ETL pipelines - is expensive and slow, and it fundamentally discards information that attention-based models can now preserve. The less obvious implication: the real bottleneck is not model architecture or compute. It is the data preparation process that has not changed in 30 years. This conversation explains the full system dynamics of why predictive modeling is ripe for disruption, and why teams that invest in relational transformers now will build lasting competitive advantage. Data scientists, ML engineers, and technical leaders will find this useful for understanding how to move past the old approach without waiting for a trillion-parameter foundation model.
Why the Obvious Fix (Engineering Features) Makes Things Worse
Most teams approach predictive modeling the same way they did in the 1990s: take a set of relational tables, manually aggregate transaction histories into summary statistics, concatenate everything into a single training table, and then train a model. The problem is not just that this is labor-intensive. Leskovec points out it takes about "two full-time employees per model" and "half a year to build a model and put it in production." The deeper issue is that aggregation destroys information. Every time you convert a sequence of transactions into a count over seven days, you lose the ordering, temporal patterns, and interactions between purchases.
"When you take a sequence of transactions and aggregate it into a transaction count over the last seven days, you have thrown away a lot of information."
The obvious solution - engineer more features - creates a feedback loop that compounds over time. More features mean more signals to update in real time as new data arrives. Each new transaction triggers a cascade of recalculations across every aggregated view. And because fraud detection is an adversarial game, the model's accuracy degrades as attackers adapt, forcing you to invent new features, retrain, and fall behind again. The system resists improvement, and it actively punishes effort by rewarding the other side.
The Missing Modality: Databases as Graphs
Leskovec's insight is that relational databases are already graphs. Foreign keys are edges. Rows are nodes. Columns are node attributes. The reason we have not treated them that way is because we lacked architectures that could attend over this structure efficiently. His team's relational transformer generalizes the attention mechanism from text - where tokens are words - to a database, where "tokens" become cells, rows, and the relationships between them.
But unlike the quadratic attention in standard transformers, relational attention is highly structured: attention inside a row, inside a column, and across tables via foreign keys. This is not a trade-off; it is a design feature that respects the data's natural organization. The immediate benefit is productivity: skip the ETL, let the model attend directly over raw tables. The second-order effect is more interesting: the model achieves "super human accuracy" because it maintains the fidelity of the original relational structure. Instead of guessing which aggregations matter, the attention mechanism learns to combine information optimally - including cross-table signals like "what are the characteristics of people who buy the same products as me?"
This is where conventional wisdom fails when extended forward. Many assume that graph neural networks (GNNs) already solve this. But Leskovec notes that GNNs suffer from oversmoothing: as you go too many hops away from a node, all neighborhoods start to look the same. Attention-based architectures avoid this because they can selectively attend over a large context window without averaging everything to mush. The consequence: you can now reason over deeper connections that were previously lost.
Foundation Models for Tabular Data: Smaller, Cheaper, and Calibrated
Here is where the conversation gets wild. Leskovec describes a "ChatGPT moment but for predictive problems": a pre-trained model that is agnostic of both the database and the task. You can ask it, "predict churn for me (no purchase in next 30 days)," and get an answer immediately, with accuracy estimates and natural language explanations. Then change the definition: "churn means less than $10 monthly spent." The model adapts without retraining. This is in-context learning for structured data.
But unlike LLMs, these models do not need billions of parameters or $5 billion GPU clusters. Leskovec's single-table models are around 25 million parameters. Multi-table models are larger but still orders of magnitude smaller than GPT-scale. Why? Because the information is already in the data: the model does not need to memorize the internet. It needs to learn how to spot patterns. And because the prediction objective is continuous (you penalize the distance between prediction and truth), the model gets a much sharper signal during training. There is no "lobotomizing" post-training phase needed to align outputs. The result is a calibrated, honest model that can even estimate its own uncertainty by holding out some in-context examples.
"Skip the ETL, take a BFI model GPU based, and just let it train directly under all data."
The delayed payoff here is significant. While everyone is chasing bigger LLMs, the teams that invest in relational foundation models will have cheaper, faster, and more accurate decision-making systems for their core operational data. And because these models can run on a single GPU (or even an iPhone for small ones), the barrier to entry is dramatically lower.
Debugging Data with Attention Traces
One of the most underrated capabilities Leskovec highlights is the explanatory power of attention. Because the model attends over actual cells and relationships, you can back-trace a prediction to see exactly which table, row, and column drove the outcome. This turns the model into a forensic tool. You can detect data leakage (e.g., a column that accidentally includes future information), discover unseen signals, and even ask counterfactual questions: "If I offer this customer a 10% discount, what happens to the probability of closing?" The implications for sales, customer support, and risk assessment are enormous. The model does more than predict; it reveals hidden dynamics in your own data that you did not know existed.
Key Action Items
-
Over the next quarter: Audit your feature engineering pipeline. Map the time and headcount spent on manual feature creation for each predictive model. Leskovec cites "two full-time employees per model" - if that resonates, relational transformers are a direct replacement that eliminates this step.
-
Immediately: Test for data leakage using attention traces. If you have an existing predictive model, run it through a relational transformer's explanation mechanism to see if any features are inadvertently using future data. The debugging capability alone can save months of retraining.
-
Over the next 6 to 12 months: Experiment with a relational foundation model on a high-value use case like churn prediction or lead scoring. Start with a small dataset (thousands of examples) to validate the approach. Use tools like KumoFM (relationalfoundationmodel.ai) or PyTorch Geometric to compare against your current pipeline.
-
12 to 18 months: Consider building a domain-specific relational foundation model if you have a large, unique dataset (e.g., banking transactions, supply chain data). Leskovec notes this requires "millions of dollars" of GPU investment, not billions, and it is already happening at enterprise scale.
-
Over the next year: Integrate relational models into agent systems. As agents become more autonomous, they need predictive reasoning over structured data, not just knowledge retrieval. Leskovec reports 30% improvements in sales team effectiveness when agents use structured predictions with counterfactuals.
-
Monitor Snowflake and data warehouse integrations for seamless deployment. Leskovec mentions these technologies will be available pre-installed in major data warehouses soon, lowering the operational overhead.
-
Long-term: Shift from ETL-based to GPU-based training loops. The real competitive advantage comes from eliminating the latency and complexity of traditional data preparation. Teams that wait for infrastructure to catch up will lose the head start.