F3: Decoupled Layout and WASM Enable Extensible Data Format

Original Title: Unfreezing The Data Lake: The Future-Proof File Format

The future-proof file format, F3, emerges not just as an incremental improvement but as a fundamental re-architecture designed to break the logjam of legacy data formats that are increasingly misaligned with modern hardware and evolving workloads. This conversation reveals a critical hidden consequence: the very formats we rely on to store vast amounts of data are actively hindering innovation, particularly in AI and machine learning. By decoupling core functionalities and embedding self-decoding capabilities, F3 offers a path to a more adaptable and extensible data infrastructure. Data engineers, ML engineers, and architects grappling with performance bottlenecks and the slow pace of format evolution will find in F3 a blueprint for building systems that can truly keep pace with technological change, offering a significant competitive advantage to those who embrace its principles.

The CPU Bottleneck: Why Compression Isn't Enough Anymore

The relentless march of hardware has fundamentally shifted the performance landscape for data storage. While storage and network speeds have surged, single-core CPU performance has plateaued. This disparity means that traditional file formats, optimized for I/O bound scenarios where reducing file size through aggressive compression was paramount, are now hitting a wall. Xinyu Zheng explains that the overhead of decompression and decoding has become a significant bottleneck.

"The first thing is that many previous assumptions simply do not hold anymore for example in the in the old hadoop era when all your storage is hard drives connected by slow network when you design a file format what you uh what you want is better compression ratio and smaller files because that typically means better query performance uh and the reason is that the bottleneck of the whole query stack is on io right and smaller files just save io bandwidth and right now since they are different you cannot only optimize for uh for smaller files for better compression ratios you have to also think about uh the overhead when you are doing the decompression like how much computational cost you have to pay when you are reading the data back from the file right because now the cpu is the bottleneck instead of the io therefore the encodings and compression inside the file format has to be changed"

This observation highlights a critical failure of conventional wisdom: assuming that what worked yesterday will work today. The implication is that simply applying existing compression algorithms to formats like Parquet or ORC is no longer sufficient. Instead, a paradigm shift is needed, focusing on lightweight, CPU-efficient decoding. Furthermore, the stagnation in single-core performance necessitates leveraging parallel processing capabilities, such as SIMD and SIMT instructions, a direction F3 aims to address.

Wide Tables and Random Access: The ML Workload Mismatch

Beyond hardware, the nature of data workloads has dramatically changed. The rise of machine learning has introduced new demands that legacy formats struggle to meet. Zheng points to two key issues: "wide table projection" and random access patterns.

"Wide table projection" refers to scenarios where datasets have thousands of features (columns), but queries only need to access a handful. Traditional columnar formats like Parquet, while columnar in principle, still incur significant overhead when dealing with such wide tables. The metadata, often encoded in formats like Thrift, must be parsed regardless of how few columns are actually accessed. This becomes a substantial performance drag.

The other major challenge is random access, crucial for both ML training and serving. During training, models benefit from data that is accessed in a non-sequential, randomized order to ensure smooth learning. For model serving, especially in vector search applications, retrieving specific data points based on identified matches requires efficient random reads. Zheng notes that formats like Parquet are not designed for this, leading to performance degradation due to I/O and computational read amplification.

"we saw that when people are doing feature engineering for example there will be table with thousands of columns and each column is just a feature added by the feature engineers and the query only wants to read very few columns we call this pattern wide table projection and parquet is not optimized for this pattern right you will always have to parse the entire file footer which contains the metadata of those thousands of columns despite that you only want to read one column"

This mismatch between the capabilities of existing formats and the requirements of modern ML workloads means that teams are effectively fighting their storage layer. The delay in fetching and processing data directly impacts model development cycles and the responsiveness of AI applications. F3’s design, by decoupling layout and metadata, and by enabling self-decoding files, aims to directly address these limitations, offering a more agile and efficient foundation for ML data.

The Extensibility Trap: Why New Encodings Go Unused

A significant hurdle in evolving data formats is the sheer number of implementations and the difficulty in achieving consensus. Zheng highlights that the proliferation of Parquet implementations across different languages and vendors has created a "lowest common denominator" effect. New features, even if standardized, are rarely adopted because ensuring compatibility across all these disparate systems is a monumental task.

"the consequence of so many implementations is that people are afraid of using new features of the format although the format spec itself is evolving like we can see parquet is adding new features year by year right but the people just tend to use the most basic one which we call it the lowest common denominator therefore the format is really hard to evolve it really takes a long time to align all the implementations on one new feature so that everyone is not freak out to use that"

This creates a frustrating cycle where innovation in format specifications is stifled by the practical realities of widespread adoption. F3 tackles this head-on with two core ideas: a decoupled, flexible layout and self-decoding files using WebAssembly (WASM). The decoupled layout separates concerns like I/O unit size, dictionary scope, and encoding choices, allowing for more granular tuning. More critically, the self-decoding approach embeds WASM kernels directly within the files. This means new encodings can be adopted without requiring every downstream engine to be upgraded. A reader simply needs to be able to execute WASM, significantly lowering the barrier to adopting advanced features and creating a durable competitive advantage for those who can leverage these new capabilities first.

Key Action Items

  • Immediate Action (Next Quarter): Evaluate current data format bottlenecks, particularly in ML/AI workloads. Identify specific pain points related to CPU-bound decoding, wide table projections, and random access.
  • Immediate Action (Next Quarter): Explore the integration of Apache Arrow as a de facto data transfer standard within your data pipelines. This aligns with F3's design principles and facilitates broader ecosystem compatibility.
  • Short-Term Investment (3-6 Months): Begin experimenting with F3's proof-of-concept implementation or similar next-generation formats in non-critical development environments. Focus on understanding its performance characteristics for your specific use cases.
  • Short-Term Investment (3-6 Months): Investigate composable query systems like DataFusion, DuckDB, or Velox, which are designed to work with modern data formats and can provide early access to F3's benefits.
  • Medium-Term Investment (6-12 Months): Consider how F3's decoupled layout and self-decoding capabilities can be leveraged to support emerging multimodal data types (images, audio, video) and associated AI workloads.
  • Medium-Term Investment (6-12 Months): Plan for the potential decoupling of file format logic from table format layers (like Iceberg) to enable greater flexibility and future-proofing of your data lake architecture.
  • Long-Term Investment (12-18 Months): Develop strategies for managing and verifying WASM kernels centrally, potentially at the data lake layer, to enhance security and reduce redundant storage, a key F3 innovation for future-proofing.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.