Lakehouse Observability: Petabyte-Scale Data Management and Interactive Troubleshooting
The conventional wisdom in observability is that real-time data analysis demands specialized, often proprietary, tools that quickly become prohibitively expensive and fragmented. This conversation with Jacob Leverich, co-founder and CTO of Observe, reveals a powerful counter-narrative: applying lakehouse architectures to observability workloads can unlock unlimited scale and drastically improve economics. The hidden consequence of this approach is not just cost savings, but a fundamental shift in how organizations can leverage their operational data, enabling deeper insights and broader access for previously underserved teams. This analysis is crucial for engineering leaders, SREs, and data architects seeking to break free from the limitations of traditional observability stacks and build a more unified, cost-effective, and powerful data foundation.
The observability landscape has long been dominated by a "best-of-breed" approach, leading to a costly and fragmented ecosystem. Teams often find themselves in "war rooms from hell," with dozens of individuals staring at disparate tools, each trying to piece together a coherent picture of system behavior. This fragmentation isn't just an inconvenience; it drives up costs significantly, forcing organizations to ration data usage and leading to incomplete visibility. As Jacob Leverich explains, this is a direct consequence of architectures often designed in a pre-cloud era, relying on local disk storage and complex replication strategies. The economic and operational burden becomes unsustainable as data volumes explode.
Leverich's journey, from sysadmin to Splunk engineer to Observe co-founder, highlights a critical realization: the core problems of observability--understanding system health, troubleshooting issues, and collaborating effectively--are fundamentally data management problems. He saw firsthand how tools like Splunk, while powerful, were architected before the advent of cloud-native principles like separated storage and compute. This led to the exploration of lakehouse architectures, which leverage commodity object storage for infinite scalability and durability at a fraction of the cost. The challenge, however, was adapting these architectures, typically optimized for batch analytics, to the low-latency, interactive demands of observability.
The Streaming Ingest Conundrum: Bridging Latency and Efficiency
A primary hurdle in applying lakehouse principles to observability is the inherent need for low-latency data ingestion and querying. Unlike business intelligence, where a few minutes or hours of delay might be acceptable, observability requires near real-time insights to address outages and performance degradations. Leverich outlines a multi-stage approach to tackle this. First, data collection relies on robust, vendor-neutral tools like the OpenTelemetry Collector, ensuring broad compatibility. The data is then streamed into Kafka, providing a durable buffer. This Kafka layer is crucial; it allows Observe to acknowledge data receipt to collectors quickly, ensuring data durability, while simultaneously batching data for efficient loading into the lakehouse.
This batching process presents a classic trade-off between latency and efficiency. Small batches lead to high overhead and cost, while large batches introduce unacceptable delays. Observe's solution involves a dynamic loader that balances these factors. For organizations generating terabytes or petabytes of data daily, this batching mechanism naturally results in sufficiently sized partitions within minutes, mitigating the perceived latency issue. This counter-intuitive finding--that at scale, throughput becomes the primary bottleneck, not latency--is a key insight for anyone considering lakehouse architectures for high-volume streaming data.
"The reality is that it's a throughput problem and really the trick is like actually I need enough parallel processing you know like kind of throughput to just even handle the data and the latency doesn't end up being your biggest problem."
-- Jacob Leverich
Curating the Lake: From Raw Data to Actionable Insights
Simply dumping raw observability data into a lakehouse table is insufficient for interactive querying. The sheer volume and semi-structured nature (often JSON blobs) of logs, metrics, and traces can lead to significant read amplification and slow query performance. Observe addresses this by organizing data around specific use cases, creating curated, columnarized tables for logs, metrics, and traces. This "streaming ETL" approach transforms raw telemetry into structured formats optimized for querying. This strategy tackles two critical problems: columnization ensures that queries only read the necessary data, and organizational structure minimizes read amplification, effectively mimicking the benefits of traditional indexing without the associated overhead.
Beyond data organization, Observe abstracts SQL away from the end-user. Instead of expecting SREs to write complex SQL queries, the platform breaks down user requests into a series of optimized, potentially parallelized SQL queries. This allows for rapid retrieval of recent data, crucial for interactive troubleshooting, while still leveraging the power of the underlying lakehouse. This focus on user experience, meeting users where they are with familiar interfaces and query paradigms, is a vital, though often overlooked, aspect of successful lakehouse adoption in specialized domains like observability.
"We've built our our kind of workflows and uis and query apis to abstract sql away from the user so that we can play a bunch of tricks and so one of those tricks is that when one of our users queries data say they query you know all the air logs over the past 24 hours our back end will actually take that and break it into a series of sql queries."
-- Jacob Leverich
The Unforeseen Benefits: Beyond Cost and Scale
The application of lakehouse architectures to observability unlocks surprising benefits beyond just cost reduction and scalability. Leverich notes the unexpected adoption by product support teams, who gain the ability to investigate issues and answer customer questions with unprecedented depth, diverting tickets from engineering and improving overall customer satisfaction. Furthermore, the integration of AI capabilities, particularly agentic workflows, has proven transformative. Support analysts, even those without deep technical backgrounds, can now ask complex questions about system behavior, with AI agents navigating the data context graph and generating the necessary queries. This leads to quantifiable improvements in mean time to resolution (MTTR) and overall productivity, demonstrating tangible business value that often eluded traditional observability tools.
The evolution of table formats, particularly Iceberg's support for JSON shredding (as seen in v3), is a significant enabler for these use cases. This capability directly addresses the challenge of handling semi-structured data, a common characteristic of observability telemetry. By allowing JSON data to be automatically columnized, Iceberg removes a major barrier, making lakehouse architectures far more amenable to observability workloads and paving the way for broader adoption and innovation in this space.
Key Action Items:
-
Immediate Action (Next Quarter):
- Evaluate Current Observability Costs: Quantify the total spend on existing observability tools, including licensing, infrastructure, and operational overhead. Identify areas of potential overspending due to data rationing or inefficient management.
- Map Data Silos: Document all systems where observability data (logs, metrics, traces) is currently stored. Understand the integration points and the manual effort required to correlate data across these silos.
- Pilot Open Telemetry: If not already in use, begin piloting the OpenTelemetry Collector for data collection across a subset of applications or infrastructure. This establishes a vendor-neutral foundation for future data pipelines.
- Explore Lakehouse Concepts: For teams with significant data volumes or high costs, begin researching lakehouse architectures (e.g., using S3/ADLS with formats like Iceberg, Hudi, or Delta Lake) and their potential application to observability.
-
Longer-Term Investments (6-18 Months):
- Develop Streaming ETL Capabilities: Invest in building or adopting streaming ETL pipelines capable of handling high-volume, low-latency data ingestion into a lakehouse. Focus on balancing batching efficiency with real-time requirements.
- Implement Use-Case-Driven Data Curation: Design and implement data models that organize observability data by use case (e.g., application logs, network traffic, traces) and columnarize it for efficient querying.
- Abstract Querying: Develop or adopt platforms that abstract raw SQL for end-users, allowing for optimized query execution and familiar interfaces for common observability tasks like log searching and dashboarding.
- Integrate AI for Analysis: Explore integrating AI capabilities, such as agentic workflows, to empower users with natural language querying and automated troubleshooting assistance, particularly for support and less technical teams. This requires investing in platforms that can expose data context effectively to AI models.
- Adopt Modern Table Formats: Commit to using open table formats like Apache Iceberg (especially v3+) which offer robust support for semi-structured data and schema evolution, crucial for long-term data management and interoperability.