The AI Revolution is Being Held Back by Old Data Problems. This Conversation Reveals the Hidden Costs and the Path to True Data Readiness.
This podcast episode, featuring Harsha Chintalapani, co-founder and CTO of Collate, dives deep into why Artificial Intelligence, particularly Large Language Models (LLMs), struggles with production data. The core thesis is that the much-hyped AI capabilities are being hobbled by fundamental, long-standing issues in how organizations manage and understand their structured data. These aren't new problems; they've simply been amplified by the advent of AI, exposing the fragility of data ecosystems built without a strong foundation of metadata and governance. Companies aiming to leverage AI effectively must confront these hidden consequences, moving beyond mere data processing to true data understanding. This discussion is crucial for data engineers, CTOs, and product leaders who are investing in AI and finding their results falling short of expectations, offering a clear roadmap to unlock the true potential of their data.
The Invisible Walls: Why AI Stumbles on Structured Data
The promise of AI, especially with LLMs, often includes the ability to make sense of vast amounts of data, even unstructured text. However, when these models encounter the structured, real-time production data that powers most businesses, they frequently falter. This isn't a failure of AI itself, but a symptom of deeper, systemic issues within data management that have been brewing for years. Harsha Chintalapani, drawing from his extensive experience at Yahoo, Hortonworks, and Uber, illuminates these challenges, revealing how a lack of semantic understanding and robust metadata management creates invisible walls that even sophisticated AI cannot easily breach.
The problem, as Chintalapani explains, is not with processing power or algorithmic sophistication. It's with the data itself. At Uber, for instance, a core concept like "customer" had wildly different definitions depending on whether you asked marketing, sales, or engineering. This ambiguity, while manageable in smaller teams, becomes a significant impediment when scaled. Imagine an LLM trying to generate a customer health report when the definition of "customer" is a moving target, or when the data required is scattered across hundreds of similarly named tables, some of which are outdated. This leads to critical errors, like the infamous instance at Uber where an analyst underreported trip numbers because they used an incorrect table.
"The problems are everyday or every other day problems are of different scopes. The problems are not so much in processing the data itself, but understanding the data. Like when the schema change, what breaks downstream? If you ask a business concept such as location, it depends on which team, which user you ask or which engineer you ask, you get a different definition."
-- Harsha Chintalapani
This lack of shared understanding, or semantics, is a human problem that AI inherits and amplifies. When data infrastructure becomes self-service, the explosion of tables and datasets, often with inconsistent naming conventions, creates a discovery nightmare. Finding the right data, ensuring its freshness and quality, and trusting its lineage become monumental tasks. AI, without this context, is essentially flying blind. It cannot discern which "customer" table is the authoritative source, nor can it understand the nuances of a "customer health" metric if those definitions were never explicitly documented and codified. This is where the immediate benefits of AI hit a hard ceiling, revealing the downstream consequences of neglecting data governance.
The Compounding Cost of Ambiguity
The impact of these data issues extends far beyond mere inconvenience. When critical business decisions are made on flawed data, the consequences can be severe. Chintalapani recounts how a lack of proper pipeline ownership and understanding at Uber led to significant accounting errors, including a massive misallocation of ad spend and incentives targeting incorrect demographics. This wasn't a case of the AI failing to process data; it was a case of the AI being fed fundamentally incorrect or misinterpreted data due to systemic issues. The immediate problem--processing data--was solved with technology, but the downstream effect--understanding and trusting that data--was left unaddressed, creating a hidden cost that compounded over time.
"For example, we spent huge amounts of money... in where we targeted it around demographic for the ads and incentives because the data came in the wrong. You go through the model and model says that, hey, this demographic, if you advertise certain coupons, they tend to buy, you know, more of Uber Eats, more of Uber rides or whatnot, right? So those are the implications of not doing the data right."
-- Harsha Chintalapani
The traditional approach of separating transactional and analytical databases, while necessary for performance, doesn't solve the semantic problem. Data warehouses and lakes provide efficient storage and processing, but they don't inherently imbue data with meaning. The real challenge lies in bridging the gap between raw data and actionable business insights. This requires a disciplined approach to metadata management, where the definitions, ownership, and lineage of data are meticulously documented and made accessible. Without this, even the most advanced AI will struggle to deliver on its promise, leaving organizations with expensive tools and unreliable results.
Building the Foundation: Metadata as the Bedrock for AI
The solution, as advocated by Chintalapani, lies in treating metadata not as an afterthought, but as a foundational principle. At Uber, the team tackled this by first focusing on understanding the metadata of all their systems. They defined standardized schemas for metadata, including ownership, data quality signals, and observability metrics. This wasn't about manually cataloging everything, which is an impossible task at scale, but about automating the process of scanning and ingesting metadata.
This systematic approach provides a "map" for the data. When an analyst or an AI needs to find information, they can query this metadata layer. For example, searching for "customer data" can reveal its location, ownership, and associated quality metrics. This capability is crucial for AI. By exposing the semantic understanding of data--what a "customer" means, how "ARR" is calculated, and which tables contain PII--organizations can enable AI to query data effectively and accurately.
"Essentially, it becomes a map for your data... Once you have that, you can actually give the metadata to the to the LLMs so that they can understand on how to understand, you know, customer AR, how to create a query of a customer AR."
-- Harsha Chintalapani
The Open Metadata platform, co-created by Chintalapani, embodies this philosophy. It aims to provide a unified platform for discovery, governance, data quality, and observability, all built on a semantic metadata graph. This approach allows for end-to-end data lineage, enabling users and AI to trace data from its source to its consumption. By making the implicit explicit through metadata, organizations can move from a reactive firefighting mode to a proactive, AI-ready state. This requires a cultural shift, treating data engineering with the same rigor as software engineering, complete with ownership, documentation, and quality standards.
Key Action Items
-
Immediate Action (Next Quarter):
- Inventory Existing Data Assets: Begin by cataloging all data sources, tables, and critical dashboards. This doesn't need to be exhaustive initially, but focus on high-impact areas.
- Define Core Business Concepts: Convene cross-functional teams (marketing, sales, product, engineering) to explicitly define 5-10 critical business terms (e.g., "customer," "ARR," "active user"). Document these definitions.
- Automate Metadata Ingestion: Implement tools or scripts to automatically pull metadata from your primary data sources (databases, data warehouses).
- Identify Critical Dashboards: Pinpoint the 5-10 most important business dashboards that drive daily decisions. This will help prioritize data lineage efforts.
-
Medium-Term Investment (Next 6-12 Months):
- Establish Data Ownership: Assign clear owners to critical data assets and pipelines. This is crucial for accountability and knowledge transfer.
- Implement Data Quality Checks: Begin defining and implementing automated data quality tests for key datasets, focusing on freshness, completeness, and validity.
- Document Data Lineage: Use metadata tools to map the flow of data for your critical dashboards and reports, tracing it back to its source.
- Explore Semantic Layer Tools: Investigate and pilot platforms that build a semantic layer or knowledge graph on top of your metadata, enabling more sophisticated data discovery and AI interaction.
-
Longer-Term Investment (12-18+ Months):
- Develop a Data Governance Framework: Formalize policies and procedures for data management, access, security, and compliance, integrating metadata and quality standards.
- Integrate AI with Metadata: Leverage your documented metadata and semantic understanding to enable AI tools (like text-to-SQL or LLM-driven analytics) to query data accurately and reliably.
- Foster a Data-Centric Culture: Continuously reinforce the importance of data quality, documentation, and semantic understanding across all departments, treating data with the same engineering rigor as code. This requires ongoing training and communication.