AI Transforms Data Engineering: New Assets, Testing, and Uptime Demands - Episode Hero Image

AI Transforms Data Engineering: New Assets, Testing, and Uptime Demands

Original Title:

TL;DR

  • AI models transform data engineering by processing unstructured data into structured assets and enabling new data types like vectors, necessitating adaptation of data preparation and storage for AI inference.
  • The rise of AI necessitates treating experimentation and evaluation as fundamental testing practices, expanding the scope beyond traditional data quality and integration tests to build confidence in evolving systems.
  • Interactive AI applications demand higher data reliability and uptime SLAs compared to traditional BI, shifting operational focus from batch processing to continuous availability for user-facing systems.
  • Generative AI accelerates development cycles by producing code and insights, shortening the path from business ideas to actionable answers and reducing reliance on manual analysis.
  • The integration of AI blurs traditional data engineering and ML engineering roles, requiring tighter collaboration and a broader skill set to manage complex, end-to-end data pipelines.
  • New data asset types, such as vector embeddings for AI models, require data engineers to adapt structuring and storage techniques, moving beyond traditional tabular formats.
  • Orchestration tools now encompass agentic AI systems, demanding new patterns and practices for managing complex execution loops and data access beyond traditional ETL workflows.

Deep Dive

The practice of data engineering is undergoing a significant transformation, blurring the lines with AI engineering as generative AI and large language models become integral to data preparation, asset creation, and operational reliability. This evolution necessitates a shift from deterministic ETL workflows to probabilistic approaches, demanding new data asset types like vectors and knowledge graphs, and fundamentally altering expectations for data timeliness and uptime, especially for interactive, user-facing AI applications.

The integration of AI into data engineering is reshaping core responsibilities and workflows. Previously distinct roles are now converging, requiring tighter collaboration between data engineers, analytics engineers, and machine learning engineers to deliver AI-powered products rapidly. AI models can now process unstructured data into structured assets, a capability that was largely sidelined in traditional data engineering. This means data engineers must incorporate language models and probabilistic technologies into their workflows, transforming raw information into useful knowledge in new ways. The emergence of vector databases and embeddings represents a new class of data assets, requiring data engineers to structure data for efficient retrieval by AI models at inference time. Furthermore, the service level agreements (SLAs) for AI-related data assets are drastically different from traditional BI use cases; downtime that might be acceptable for a data warehouse can be catastrophic for a customer-facing AI application. This pressure for continuous availability is driving new operational characteristics and the need for memory stores. The use of retrieval augmented generation (RAG) also highlights a renewed interest in graph technologies for building knowledge and semantic graphs.

The pace of AI development and adoption is accelerating, demanding faster dataset onboarding and increased data accessibility for AI models. This shift is driven by AI's ability to generate code and insights, shortening the cycle from business idea to actionable answer. For instance, natural language processing capabilities of LLMs can now parse customer feedback from various sources, potentially reducing the need for dedicated business analysts for certain tasks. This also impacts orchestration, where traditional ETL orchestrators are now being overlaid with the complexities of agentic AI systems, requiring new governance, security, and access control considerations. A critical consequence of this evolution is the necessity to integrate experimentation and evaluation as fundamental testing practices, moving beyond traditional data quality monitoring and unit tests. As models become core to pipelines, confidence building through rapid evaluation of new models and data is essential to maintain momentum and adapt to changing behaviors without fear of breakage. This increased dimensionality of complexity, moving from deterministic software to data engineering and now to AI engineering, requires data practitioners to adapt their skills, collaborate more closely with AI teams, and understand the capabilities and limitations of AI models to operate effectively at higher levels of abstraction. The biggest gap in current data management tooling is the lack of established patterns and practices for effectively integrating AI and evolving data structures and delivery to accommodate these new access paradigms.

Action Items

  • Audit AI data pipelines: Assess 5-10 critical AI data flows for unstructured data processing and vector embedding generation.
  • Create runbook template: Define 5 required sections (setup, common failures, rollback, monitoring) for AI-specific data assets.
  • Implement experimentation workflows: Integrate evaluation as a core testing practice for 3-5 AI model deployments.
  • Measure AI data asset SLAs: Track uptime and latency for 5-10 customer-facing AI data services.
  • Refactor orchestration: Evaluate existing ETL orchestrators for suitability with agentic AI loops, targeting 2-3 key use cases.

Key Quotes

"The introduction of generative AI and AI Engineering to the technical ecosystem is changing the scope of responsibilities for data engineers and other data practitioners. Of note is the fact that: AI models can be used to process unstructured data sources into structured data assets, AI applications require new types of data assets, the SLAs for data assets related to AI serving are different from BI/warehouse use cases, the technology stacks for AI applications aren't necessarily the same as for analytical data pipelines, because everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminology."

Tobias Macey explains that the advent of generative AI is fundamentally altering the role of data engineers. He highlights several key shifts, including the ability of AI to transform unstructured data, the emergence of new data asset requirements for AI applications, and the altered service level agreements (SLAs) for AI-related data compared to traditional business intelligence use cases. Macey also notes the challenges posed by new technology stacks and the lack of standardized terminology in this rapidly evolving field.


"Because everything is so new there is not a lot of prior art, and the prior art that does exist isn't necessarily easy to find because of differences in terminology."

Tobias Macey points out that the novelty of AI engineering means there is a scarcity of established practices and documentation. He further elaborates that even when existing knowledge is available, it can be difficult to access or understand due to the inconsistent use of terminology across different projects and teams. This lack of a common language and established methods creates a significant hurdle for engineers entering this space.


"Another aspect of the ways that these language models and generative systems have changed the work of data engineers is that we have new types of data assets that we're responsible for with vector databases and vectors being the biggest example where before we would have tabular data you would maybe do some feature engineering on that and then send that to the training and maybe hydration of a machine learning model for being able to create some prediction or perform some action but now we need to take structured semi structured and unstructured data turn them into those vector embeddings and then store them in a relatively new technology for many in the form of these vector databases so that the models can retrieve that at inference time and be able to do it quickly and effectively."

Tobias Macey describes how generative AI introduces new data asset types, specifically mentioning vector databases. He contrasts this with previous practices involving tabular data for machine learning models. Macey explains that data engineers now need to convert various data formats into vector embeddings and store them in vector databases for efficient retrieval during AI model inference. This represents a significant evolution in data structuring and management for AI applications.


"Unless you were working in an organization that relies heavily on real time streaming data the sla or service level agreement or the reliance of that data and its timeliness has changed a lot and the uptime in particular of that data has changed a lot where if you were building something for a data warehouse that was feeding into a business intelligence system there's a pretty high probability that if it goes down for 15 minutes or an hour in the middle of the night well you reload the data warehouse or update data or fix some bugs it's not going to be that big of a deal but if you are running a vector store and it is powering a customer facing llm that is doing inference and providing an interactive use case if that same downtime happens it's a much bigger deal."

Tobias Macey highlights the increased criticality of data timeliness and uptime due to AI applications. He contrasts the impact of downtime for traditional business intelligence systems with that of customer-facing AI models, such as those using vector stores for LLM inference. Macey emphasizes that for interactive AI use cases, even short periods of downtime are far more consequential, necessitating higher reliability standards for data infrastructure.


"Maybe one of the biggest new requirements especially for data engineers is the fact that the way that we test these different workflows has to change where we have things like data quality monitoring we have unit tests we have integration tests to make sure that if the logic breaks if some new data comes in that breaks our prior assumptions we would react to it and fix it but as the models start to become more of the core execution either in terms of the actual pipelines themselves or in terms of the serving of what the data is feeding into everybody in the whole workflow needs to be able to include experimentation and evaluation as that means of building confidence and verifying changes and maintaining functionality."

Tobias Macey argues that the testing methodologies for data workflows must evolve with the integration of AI models. He explains that traditional methods like data quality monitoring, unit tests, and integration tests are insufficient when models become central to execution. Macey stresses the necessity of incorporating experimentation and evaluation into the workflow to build confidence, verify changes, and ensure functionality in AI-driven systems.

Resources

External Resources

Articles & Papers

  • "From Data Engineering to AI Engineering: Where the Lines Blur" (Data Engineering Podcast) - Discussed as a solo episode reflecting on the evolution of data engineering due to AI.

Websites & Online Resources

  • dataengineeringpodcast.com/prefect - Referenced for information on Prefect's capabilities in data orchestration.
  • dataengineeringpodcast.com/bruin - Referenced for information on the Bruin open-source framework.
  • dataengineeringpodcast.com/datafold - Referenced for information on Datafold's AI-powered Migration Agent.
  • mongodb.com/Build - Referenced for information on MongoDB's platform for building AI applications.
  • AI Engineering Podcast - Mentioned as a related podcast exploring the world of building AI systems.

Podcasts & Audio

  • Data Engineering Podcast - The primary podcast for this episode, discussing modern data management.
  • AI Engineering Podcast - Mentioned as a related podcast exploring the world of building AI systems.

Other Resources

  • Hadoop era - Referenced as a historical period in data management.
  • Cloud warehouse era - Referenced as a historical period in data management.
  • MLOps - Discussed as an evolution in operationalizing machine learning workflows.
  • Vector databases - Mentioned as a new type of data asset and storage technology for AI.
  • Knowledge graphs - Referenced as a technology driven by the use of language models.
  • Semantic graphs - Referenced as a technology driven by the use of language models.
  • Retrieval Augmented Generation (RAG) - Mentioned as a concept driving interest in graph technologies.
  • Agentic AI systems - Discussed in the context of new orchestration requirements.
  • Data quality monitoring - Mentioned as a testing practice for data workflows.
  • Unit tests - Mentioned as a testing practice for data workflows.
  • Integration tests - Mentioned as a testing practice for data workflows.
  • Experimentation - Discussed as a fundamental testing practice for AI systems.
  • Evaluation - Discussed as a fundamental testing practice for AI systems.
  • Deterministic software - Contrasted with the complexity of AI engineering.
  • Data pipelines - Referenced as a core component of data engineering.
  • ETL - Mentioned as a traditional orchestration tool.
  • Dax - Mentioned as an orchestration tool.
  • Airflow - Mentioned as an orchestration tool.
  • Prefect - Mentioned as an orchestration tool.
  • ACID compliant - Referenced as a characteristic of MongoDB.
  • Fortune 500 - Mentioned as a trust indicator for MongoDB.
  • New England Patriots - Mentioned as an example team for performance analysis.
  • Pro Football Focus (PFF) - Data source for player grading.
  • Cash App - Mentioned as a user of Prefect.
  • Cisco - Mentioned as a user of Prefect.
  • Whoop - Mentioned as a user of Prefect.
  • 1Password - Mentioned as a user of Prefect.
  • dbt Cloud - Mentioned in relation to Bruin Cloud migration credit.
  • MongoDB - Discussed as a platform for developers to ship AI apps.
  • The Hug by The Freak Fandango Orchestra - Mentioned as the source of intro and outro music.
  • CC BY-SA - Referenced as the license for the music.
  • Podcast Net - Mentioned as a related podcast covering the Python language.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.