F3: Decoupled Layout and WASM Enable Extensible Data Format - Episode Hero Image

F3: Decoupled Layout and WASM Enable Extensible Data Format

Original Title:

TL;DR

  • F3's decoupled layout separates I/O units, dictionary scope, and encoding choices, enabling flexible optimization beyond Parquet's fixed row group constraints.
  • Self-decoding files embedding WebAssembly kernels allow F3 to adopt new encodings without requiring universal engine upgrades, fostering extensibility.
  • F3 addresses Parquet's CPU-bound decoding and metadata overhead for wide-table projections by rethinking data layout and encoding strategies.
  • The format's design supports random access patterns crucial for ML training and serving, mitigating read amplification issues in older formats.
  • Decoupling file and table formats allows F3 to integrate with table layers, centralizing and verifying WASM kernels for enhanced security and efficiency.
  • F3's extensibility, particularly through WASM, can extend beyond encodings to indexing or filtering, paving the way for future data management innovations.

Deep Dive

The F3 "future-proof file format" aims to address limitations in current columnar formats like Parquet and ORC by rethinking data layout and encodings for evolving hardware and workloads. This new format is designed for efficiency, interoperability, and extensibility, enabling faster adoption of new technologies and supporting diverse data types, particularly for AI-driven applications.

F3's core innovation lies in its decoupled, flexible layout and its self-decoding capability. Unlike Parquet, which tightly couples concepts like I/O units, dictionary scope, and encoding choices, F3 separates these, allowing for granular tuning and optimization. This flexibility is crucial because hardware performance trends--specifically, faster storage and networks but stagnating single-core CPU speeds--render old assumptions obsolete. File formats must now balance compression with lightweight, parallelizable decoding, often leveraging SIMD instructions. Moreover, evolving workloads, such as handling thousands of features in AI models (wide table projections) or requiring rapid random access for ML training and serving, are not well-supported by the current generation of formats. F3's decoupled layout allows for independent optimization of these aspects, improving performance for these new use cases.

A significant second-order implication of F3's design is its approach to extensibility through WebAssembly (WASM). Instead of requiring broad community consensus and widespread implementation updates for new encodings, F3 embeds WASM kernels within files. This allows new, highly efficient encoding algorithms to be adopted by individual users or teams without waiting for ecosystem-wide standardization. This self-decoding capability simplifies the adoption of new features, as older readers can still process files containing new encodings, fostering a more agile evolution of the format. This contrasts sharply with the slow, fragmented adoption of new features in formats like Parquet, where numerous implementations create a "lowest common denominator" effect. Furthermore, F3's native support for Apache Arrow, a de facto standard for data transfer, and its integration potential with composable query systems like DataFusion and DuckDB, are critical for broader ecosystem adoption.

The implications of F3 extend to how data lakes and table formats can evolve. By decoupling file format concerns from table format management, F3 enables greater combinations of functionalities. The self-decoding WASM kernels, while enhancing file autonomy, can also be managed more centrally at the data lake or table format layer. This offers potential benefits like reduced redundant storage of WASM code and, critically, enhanced security through centralized verification of WASM binaries, mitigating concerns about executing arbitrary code embedded in files. This layered approach allows for more robust and secure adoption of advanced features.

The key takeaway is that F3 represents a paradigm shift towards more adaptable and future-oriented data storage. By embracing flexibility in layout and extensibility through WASM, it directly addresses the limitations imposed by legacy assumptions and the rapid pace of technological change. This design is poised to unlock new capabilities for AI and other data-intensive workloads, moving beyond mere performance optimization to solve critical issues of interoperability, extensibility, and long-term maintainability in data infrastructure.

Action Items

  • Audit F3 file format: Evaluate 3-5 core encoding algorithms for CPU-bound decoding overhead.
  • Implement F3 WASM runtime: Integrate WebAssembly kernels for custom encoding within 2-week sprint.
  • Design F3 layout extensibility: Decouple IO units, dictionary scope, and encoding choices for 5 core data types.
  • Track F3 adoption metrics: Measure integration with 3-5 major compute engines (e.g., DataFusion, DuckDB).
  • Evaluate F3 for multimodal data: Analyze storage and access patterns for images, audio, and video within 10 key AI use cases.

Key Quotes

"The full name of F3 is future proof file format and as the name indicates we aim to build a next generation columnar file format that is future proof in order to achieve this goal when we designed the format at the beginning we tried to make it efficient interoperable and extensible at the same time that's the basic principle when we designed the format or the high level goals."

Xinyu Zheng explains that the F3 project aims to create a next-generation columnar file format designed for future adaptability. The core principles guiding its design are efficiency, interoperability, and extensibility, addressing limitations in existing formats.


"The first thing is that many previous assumptions simply do not hold anymore for example in the in the old hadoop era when all your storage is hard drives connected by slow network when you design a file format what you what you want is better compression ratio and smaller files because that typically means better query performance and the reason is that the bottleneck of the whole query stack is on io right and smaller files just save io bandwidth and right now since they are different you cannot only optimize for for smaller files for better compression ratios you have to also think about the overhead when you are doing the decompression like how much computational cost you have to pay when you are reading the data back from the file right because now the cpu is the bottleneck instead of the io therefore the encodings and compression inside the file format has to be changed so basically which means it has to be lightweight by lightweight i mean the decoding and the decompression to consume two more cycles to make the process a bottleneck."

Xinyu Zheng highlights how hardware evolution, specifically faster storage and stagnant single-core CPU performance, necessitates a shift in file format design. Zheng argues that optimizing solely for compression and smaller file sizes, as done in older formats, is no longer sufficient; lightweight decoding that minimizes CPU overhead is now crucial.


"The recent focus on vector data is perhaps the most visible change in storage requirements. What are some of the other custom types of data that might need to be supported in the file storage layer? ... Besides vectors i think there are also other data types like images audio and video those data types actually impose a challenge on the file format if you really want to store them in place in a file format so first those data pads are much larger than the traditional ones so think of like integer is just 8 bytes right and strings are maybe 10 bytes or like tens of bytes but the image is in the level of megabytes and video is in hundreds of megabytes or even gigabytes so when you fit them into a single file you will have to make sure those data with a large variety of bits do not mess up with each other like when you are storing those images those videos those audios you are going you are also going to store some metadata like some descriptions some feature flags you you your model got from those images and videos right and those features those flags those metadata are just traditional data it's integers or like string or float but you you are going to want to store them together and then there comes a problem that how to align those small data together with those large data and and how you should partition the file so that when you are querying those two types of data together your query performance will not get affected and also when you are storing those two type of data together you still have good compression ratio."

Xinyu Zheng discusses the challenges posed by multimodal data, such as images, audio, and video, in file storage. Zheng explains that these large data types, alongside traditional metadata, require careful organization within a file to maintain query performance and compression efficiency.


"The key challenge of introducing a new format is that of adoption. What are the provisions in F3 that might simplify the adoption of the format in the broader ecosystem? (e.g. integration with compute frameworks) ... I think first the the self decoding ability I just mentioned is actually a very very attractive feature and is actually a strings for F3 to be integrated into the broader ecosystem because once you you'll be integrated it's much easier than other formats to get evolved right for other formats you for example when you are proposing a new encoding you have to go go through the process of like submitting a pull request on github getting it merged and then you have to make sure all the readers are upgrading to the new version so that it can use the new feature but right now in F3 we have this kind of like forward compatible ability where old readers can even adopt the new features the new encodings so I think that's actually a very attractive feature for F3 to to be get adopted."

Xinyu Zheng identifies the self-decoding capability as a key feature for F3's adoption. Zheng explains that this allows older readers to utilize new features and encodings, simplifying the adoption process compared to formats that require universal upgrades to leverage new functionalities.


"The second thing is that we want to make the encoding what which is that I just mentioned like inside the data block you are going to need to have some encoding and compression to make the data block as small as possible right but we found that in parquet although in the past few years parquet also added some new encodings in the file format and in the academia we see that there are many different research on proposing new encodings but none of those new encodings are getting used by users today so that's the problem and we are thinking about how to solve it right how to solve it in a way that is forward compatible and then we came out to came up to the idea that we can actually make the file format self decoding."

Xinyu Zheng addresses the challenge of new encodings not being adopted in existing formats like Parquet. Zheng proposes making the file format "self-decoding" as a solution, enabling forward compatibility and easier integration of new encoding algorithms.


"The biggest gap is that when people are designing software or technology they should really try to make it future proof as the as the name in the F3 project right it has to be forward looking because when we are designing of software we don't want it to to be only fit in the current workload or the current hardware right we want it to to be reused in future or like survive for a long time yeah I think that's that's the biggest lesson I learned from this project and I also think that's a gap for many technologies."

Xinyu Zheng concludes by stating that the most significant gap in current software and technology design is the lack of a "future-proof" approach. Zheng emphasizes the importance of forward-looking design that allows technologies to remain relevant and reusable across evolving workloads and hardware.

Resources

External Resources

Books

  • "An empirical evaluation of columnar format" by Xinyu Zheng - Mentioned as the first paper to systematically study the internals of Parquet and ORC to quantify problems and identify areas for improvement.

Articles & Papers

  • "F3 Paper" (VLDB) - Discussed as the paper detailing the F3 project, a "future-proof file format."
  • "Formats Evaluation Paper" (VLDB) - Discussed as the paper that systematically studied the internals of Parquet and ORC to quantify problems and identify areas for improvement.
  • "SAL Paper" (VLDB) - Mentioned in the context of file format research.
  • "Towards functional decomposition of file format" - Mentioned as a paper that argues for decoupling index from data, suggesting that index tuning should be independent of the file format.
  • "PAX == Partition Attributes Across" - Mentioned as a terminology related to data layout in file formats.

Tools & Software

  • F3 Github - Mentioned as the repository for the F3 project.
  • Parquet - Discussed as a widely adopted columnar file format with limitations in its design and implementation.
  • ORC - Discussed as a widely adopted columnar file format with limitations in its design and implementation.
  • Arrow - Mentioned as the de facto standard for transferring data across API boundaries and systems, with F3 having native support for it.
  • Protocol Buffers - Mentioned as a format similar to Thrift, used for metadata encoding in Parquet.
  • Lance - Mentioned as a project that has both a file format and a table format.
  • Vortex File Format - Mentioned as an example file format that has native support for DataFusion and DuckDB, facilitating adoption.
  • DataFusion - Mentioned as a composable query system engine that F3 aims to integrate with.
  • DuckDB - Mentioned as a composable query system engine that F3 aims to integrate with.
  • DuckLake - Mentioned as a project that supports Parquet and other existing file formats.
  • Velox - Mentioned as a composable query system engine that F3 aims to integrate with.

People

  • Xinyu Zheng - PhD researcher discussing F3, the "future-proof file format."
  • Andy Pavlo - Collaborator/advisor of Xinyu Zheng, discussed in relation to public conversations about Parquet and Arrow integration.
  • Justin Mccarry - Collaborator/advisor of Xinyu Zheng, discussed in relation to public conversations about Parquet and Arrow integration.
  • Wes McKinney - Mentioned in the context of data management research.

Organizations & Institutions

  • Tsinghua University - Institution where Xinyu Zheng is a PhD student.
  • University of Wisconsin Madison - Institution where Xinyu Zheng obtained their bachelor's degree.
  • RisingWave - Mentioned as an internship experience for Xinyu Zheng.
  • Tencent Cloud - Mentioned as an internship experience for Xinyu Zheng.
  • CMU - Mentioned in relation to public seminars and research.
  • Databricks - Mentioned as a company with proprietary implementations of Parquet.
  • Snowflake - Mentioned as a company with proprietary implementations of Parquet.
  • Apache Arrow - Mentioned as the de facto standard for data transfer.
  • Apache Parquet - Mentioned as a widely adopted columnar file format.
  • Apache ORC - Mentioned as a widely adopted columnar file format.
  • Apache Iceberg - Mentioned as a table format that was originally coupled with Parquet but is moving towards decoupling.
  • Apache Commons - Mentioned in the context of file format research.
  • Apache DataFusion - Mentioned as a composable query system engine.
  • Apache Commons Lang - Mentioned in the context of file format research.

Websites & Online Resources

  • xinyuzeng.xyz - Personal website of Xinyu Zheng.
  • mongodb.com/Build - Website for MongoDB, mentioned in an announcement.
  • dataengineeringpodcast.com/bruin - Website for Bruin, mentioned in an announcement.
  • freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug - Source of intro and outro music.
  • freemusicarchive.org/music/The_Freak_Fandango_Orchestra/ - Artist page for the intro/outro music.
  • creativecommons.org/licenses/by-sa/3.0/ - Creative Commons license for the music.

Other Resources

  • F3 ("future-proof file format") - A next-generation columnar file format designed to be efficient, interoperable, and extensible.
  • WebAssembly (WASM) - A binary instruction format for a stack-based virtual machine, used in F3 to embed decoding algorithms.
  • Columnar formats - A type of file format where data is organized by columns rather than rows.
  • Metadata - Information that describes data, used in file formats like Parquet.
  • Thrift - A protocol used for encoding metadata in Parquet.
  • Row groups - Partitions of data within a Parquet file.
  • Dictionary encoding - A compression technique used in columnar formats.
  • SIMD instructions - Single Instruction, Multiple Data instructions used for parallel processing on CPUs.
  • SIMT - Single Instruction, Multiple Threads, used for parallel processing on GPUs.
  • OLAP workload - Online Analytical Processing workload, typically involving batch-oriented analytics.
  • Wide table projection - A query pattern where a table has many columns, but only a few are read.
  • Random access - Accessing data in a non-sequential manner, often used in ML training and serving.
  • Read amplification - An increase in the amount of data read from storage due to inefficient access patterns.
  • Table format - A metadata layer on top of file formats that provides features like ACID transactions and multi-version management.
  • ACID transactions - Atomicity, Consistency, Isolation, Durability, a set of properties for reliable transaction processing.
  • Composable data infrastructure - A system where different data components can be easily integrated.
  • Lineage tracking - The process of tracking the origin and transformations of data.
  • Data quality monitoring - The process of ensuring the accuracy and reliability of data.
  • Governance enforcement - The process of ensuring compliance with data policies and regulations.
  • AI - Artificial Intelligence, a field driving new data requirements.
  • Multimodal data - Data that includes different types of information, such as text, images, and audio.
  • Vector data - Data represented as numerical vectors, commonly used in machine learning.
  • Vector search - A type of search that uses vector representations to find similar items.
  • Scalable indexing - Indexing techniques designed to handle large datasets efficiently.
  • Secondary index - An index that provides an alternative way to access data in a table.
  • Scale index - A type of secondary index that considers data scanning along with filtering.
  • Zoom map - A data structure used in Parquet for efficient data retrieval.
  • Bloom filter - A probabilistic data structure used to test whether an element is a member of a set.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.