PuppyGraph Enables Zero-Copy Graph Querying On Existing Data Lakes

Original Title: Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

In a landscape increasingly dominated by AI agents and the need for real-time data insights, PuppyGraph emerges as a critical enabler, offering a "zero-copy" approach to graph querying directly on existing data lakes and warehouses. This conversation with Weimo Liu reveals not just a technical innovation, but a fundamental shift in how we can access and leverage connected data. The hidden consequence of traditional graph databases--the immense ETL burden and data duplication--is sidestepped entirely, unlocking immediate value from complex, interconnected datasets. This is essential reading for data engineers, architects, and AI practitioners who aim to build sophisticated, responsive systems without the usual data movement overhead, gaining a significant competitive advantage through faster iteration and deeper insights.

The Unseen Cost of Data Silos: Reimagining Graph Analytics

The persistent challenge in harnessing the power of graph data has always been the friction involved in getting data into a graph-native format. Traditional graph databases often demand extensive ETL processes, leading to data duplication, increased operational complexity, and significant delays. Weimo Liu, co-founder of PuppyGraph, directly confronts this by building an engine that queries data where it lives--in data lakes like Iceberg, Delta, and Hudi, or even transactional databases. This "zero-copy" philosophy isn't just about convenience; it's a strategic move to eliminate a major bottleneck, allowing for rapid exploration and analysis of connected data.

The implications of this are profound. Instead of spending months loading data, organizations can immediately begin querying it. This accelerates the discovery of critical patterns, especially in domains like cybersecurity, where real-time analysis of logs can mean the difference between detecting a threat and suffering a breach. Liu highlights how a security research team used PuppyGraph to analyze logs stored in Iceberg, uncovering an active botnet network long after the attackers were believed to be apprehended. This wasn't a planned use case; it was a direct result of making complex graph analysis accessible on existing data.

"Since 2022, GPT was becoming popular and some of my friends are founders of big large model projects and they shared with me that no one will write SQL or any other query in the future and agents will do everything. And we feel that, 'Oh, this is a big opportunity.'"

This quote frames PuppyGraph's genesis: a response to the burgeoning need for AI agents that require sophisticated data understanding. Agents, unlike traditional users, don't necessarily operate within the constraints of human-readable SQL. They can generate and execute complex graph traversals, demanding an engine that can handle these queries efficiently on large, distributed datasets. PuppyGraph's architecture, inspired by MPP (Massively Parallel Processing) systems and designed to shard data by address rather than by node, tackles the notorious "supernode" problem head-on. This allows for linear scaling of multi-hop traversals, a feat that often brings traditional graph databases to their knees. The ability to perform 10-hop traversals in seconds, rather than hours or days, represents a paradigm shift, enabling applications that were previously impractical.

The Trade-off: Latency vs. Immediacy

While PuppyGraph aims for sub-second to single-digit-second query performance, it's crucial to understand its positioning. It is not designed for the millisecond-level latency of in-memory transactional databases. Instead, it optimizes for the "warm" path--providing rapid access to data that resides in the data lake. Liu explains that the overhead of fetching data from S3, for instance, is around 50-100 milliseconds. This is mitigated through techniques like adaptive caching, leveraging Iceberg's metadata to quickly retrieve recently accessed data from local disk or memory. This strategy is a direct consequence of prioritizing access to existing data over data duplication. The immediate benefit is the elimination of ETL, but the trade-off is a slightly higher baseline latency compared to purely in-memory solutions. However, for analytical workloads and agentic reasoning, this latency is often a negligible cost for the immense gain in data accessibility and reduced operational overhead.

"We don't optimize for 10 milliseconds or 20 milliseconds query at all. So, in that case, there are a lot of in-memory solutions and we just kind of gave up on it. What we are good at is like sub-second or single-digit second."

This candid admission highlights where PuppyGraph excels. It targets use cases where the time saved by avoiding ETL and data movement far outweighs the difference between millisecond and sub-second response times. For tasks like entity resolution, cybersecurity analysis, or powering AI agents that need to understand complex relationships, this performance profile is more than adequate and, in fact, represents a significant improvement over traditional methods.

Bridging the Graph and Data Engineering Worlds

Historically, the graph community and the data engineering community have operated somewhat in parallel. Data engineers focused on building robust pipelines for structured data, while graph specialists grappled with the complexities of graph storage and processing. PuppyGraph actively bridges this gap. By building on open table formats like Iceberg, it integrates seamlessly into existing data lake architectures. This means data engineers don't need to learn entirely new systems or build separate pipelines for graph analytics.

Liu emphasizes that PuppyGraph's operator-based engine, which treats graph operations as collections of node and edge operators, allows for MPP and vectorized evaluation. This is crucial because column-based storage, inherent in formats like Parquet used by Iceberg, is highly memory-efficient for analytical queries. Unlike row-based storage, which is optimal for transactional updates, column-based storage allows PuppyGraph to fetch only the necessary attributes for a query, drastically reducing memory footprint and improving performance. This technical underpinning is what enables PuppyGraph to scale graph computations effectively on massive datasets, a problem that has plagued many previous attempts at distributed graph processing.

"Since the graph community is separate from the data engineer community. The data engineer community is involved a lot, but the graph community has not changed a lot in the last 10 years. I think if we can bring the capability together, it not just benefits agents as we designed, but also benefits human users as well."

This integration is not just about technical compatibility; it's about democratizing graph analytics. By making graph querying accessible within familiar data lake environments, PuppyGraph empowers a broader range of users, from data scientists to business analysts, to uncover insights hidden within their connected data. The ability to define graph schemas on logical views or use flexible mappings for denormalized tables further reduces the barrier to entry.

Actionable Takeaways

  • Immediate Action: Evaluate your current data lake/warehouse for graph analytics potential. Identify use cases where complex relationships are critical but current analysis is slow or impossible due to ETL.
  • Immediate Action: Explore PuppyGraph's capabilities for querying existing Iceberg, Delta, or Hudi tables. Test a small, representative dataset to understand performance and ease of use.
  • Immediate Action: Consider how AI agents in your organization could leverage direct graph access for more sophisticated reasoning and task execution.
  • Short-Term Investment (1-3 months): Begin modeling key entities and relationships from your existing tabular data into a graph schema within PuppyGraph. Focus on areas like entity resolution or cybersecurity log analysis.
  • Short-Term Investment (1-3 months): Investigate how PuppyGraph can complement existing transactional graph databases by offloading analytical workloads, reducing costs and improving performance for both systems.
  • Medium-Term Investment (6-12 months): Integrate PuppyGraph into your data pipelines to support agentic workflows, enabling agents to query and reason over your connected data directly.
  • Long-Term Investment (12-18 months): Develop internal expertise in graph data modeling and querying using PuppyGraph to unlock deeper insights and build more intelligent applications. This requires embracing the discomfort of learning a new paradigm but promises significant competitive advantage through a more nuanced understanding of your data's interconnectedness.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.