Malloy: Human-Centric Data Interaction Beyond SQL
TL;DR
- Malloy treats SQL as an assembly language, enabling humans to interact with data through a more intuitive, hierarchical model rather than verbose, low-level SQL queries.
- Malloy's design prioritizes a user-centric, interactive data journey, preserving context so analysis remains open-ended and iterative, unlike traditional, dead-end data reduction methods.
- By integrating semantic modeling with a query language, Malloy allows data to be viewed and queried as meaningful information, not just raw storage, enriching the data landscape with each question asked.
- Malloy's composable nature and focus on human understanding make it a surprisingly effective target for LLM-generated queries, bridging natural language interactions with structured data analysis.
- Malloy aims to abstract away the complexities of relational algebra and storage details, allowing users to focus on expressing information needs and relationships, rather than SQL's mathematical constructs.
- The language's design incorporates familiar SQL keywords and gestures where appropriate, easing adoption for experienced data professionals while introducing novel concepts for more effective data interaction.
- Malloy's potential extends to transformation pipelines, enabling context to flow through stages and allowing for richer decision-making based on historical analysis, moving beyond SQL's table-centric limitations.
Deep Dive
Malloy aims to redefine data interaction by providing a human-centric, composable language that generates SQL, addressing the inherent limitations and verbosity of SQL itself. This approach shifts the focus from the intricacies of database storage and relational algebra to the natural way humans think about and explore data, thereby enabling more intuitive and maintainable analytics.
The core problem Malloy tackles is that SQL, while powerful for efficient query execution, often acts as a barrier to intuitive data exploration and analysis. Its mental model, shaped by its age and relational algebra roots, can be obtuse for users trying to answer complex business questions. Malloy, born from the experiences at Looker, introduces a semantic modeling layer tightly coupled with a query language. This integration allows users to define data in terms of business meaning rather than storage structure, preserving context throughout the analytical journey. The implication is that data exploration becomes more interactive and open-ended; instead of producing a static table, each query enriches the data landscape, enabling subsequent questions to be asked with greater insight. This contrasts with traditional SQL-based pipelines where context is lost once data is transformed into tables, limiting further iterative analysis.
Malloy's design prioritizes a developer-centric experience, treating it as a modern programming language with features like composability, readability, and maintainability. This contrasts with SQL's original, largely unmet, goal of being accessible to business users. By adopting a hierarchical data model as the default, Malloy simplifies how users conceptualize and interact with complex datasets, regardless of their underlying storage format. The language is implemented in TypeScript, offering a robust runtime and seamless integration with development tools like VS Code, and it can generate SQL, positioning it as a modern "assembly language" for data execution. The second-order implication here is a potential shift in the skill requirements for data professionals, moving towards proficiency in expressive data modeling languages over deep SQL optimization. Furthermore, Malloy's design has proven surprisingly effective as a target for Large Language Models (LLMs) generating natural language queries, suggesting a future where human-AI collaboration in data analysis is significantly enhanced. The project's open-source nature invites community contribution, aiming to mature its capabilities in areas like dimensional filtering, advanced aggregation strategies, and closing remaining gaps that currently require escaping to raw SQL.
Action Items
- Audit authentication flow: Check for three vulnerability classes (SQL injection, XSS, CSRF) across 10 endpoints.
- Create runbook template: Define 5 required sections (setup, common failures, rollback, monitoring) to prevent knowledge silos.
- Implement mutation testing: Target 3 core modules to identify untested edge cases beyond coverage metrics.
- Profile build pipeline: Identify 5 slowest steps and establish 10-minute CI target to maintain fast feedback.
Key Quotes
"Malloy was just, you know, we had some time to to sort of take a fresh cut at thinking about how to interact with data and our frustrations with Looker resulted in Malloy."
Michael Toy explains that Malloy emerged from a period of reflection after his experience at Looker. This indicates that Malloy is not just a new tool, but a response to perceived limitations or challenges encountered in previous data interaction paradigms.
"SQL is a computer language and we've learned a lot about computer languages since then and if someone were designing SQL today they they would write a different language that didn't have all those problems because we've learned a lot about software engineering and so I am passionate about all those things and and I've tried to fix absolutely all those problems in the design of Malloy."
Michael Toy highlights that SQL, being an older language, predates many modern software engineering principles. He asserts that Malloy was designed with these later learnings in mind, aiming to address the inherent problems in SQL by incorporating contemporary language design concepts.
"The core problem that you are trying to solve with Malloy is that all interactions with data should be interactive and open ended so that at one point the traditional way that data worked is there's a big room with spinning disks in them and then there's some people who write SQL to reduce that down to a smaller subset of data and then there's some people who worked on that smaller subset of data who then and that to decision a curated version of that to decision makers and then all decisions are being made on three generations away from the original data."
Michael Toy articulates the central issue Malloy addresses: the traditional, often multi-step and removed, process of data analysis. He contrasts this with Malloy's goal of enabling direct, interactive, and open-ended data exploration, where insights are closer to the original data.
"We think of SQL as the assembly language of data like if you want to do things with data the machine that you that should be doing your data thing is SQL. We just shouldn't ever ask humans to write SQL that's just an unfair thing in the same way that we don't ask humans to write assembly language."
Michael Toy positions SQL as the underlying execution layer for data operations, analogous to assembly language for computer programs. He argues that while SQL is powerful for machines, it is too complex and inefficient for direct human use, advocating for a higher-level abstraction like Malloy.
"One of the most important activities in some sort of long lived maintainable piece of software is -- I open it up and throw it on a page -- and then with my eyeballs I understand something about its structure. And so it's like I am flying over a city and I can see that there's the football stadium and there's the high school."
Michael Toy emphasizes the importance of visual comprehension and structural clarity in software design. He uses the analogy of an aerial city view to illustrate how a well-structured language allows users to quickly grasp the overall organization and identify specific components without needing to examine every detail.
"Malloy is designed for humans to understand it turned out to be a really good target for LLMs to generate like I want to make a really good query experience for users and I want them to interact in a natural language and then I want to generate something which actually generates queries."
Michael Toy expresses surprise at Malloy's unexpected utility for Large Language Models (LLMs). He explains that Malloy's human-centric design, intended for intuitive data interaction, makes it an effective target for LLMs to generate queries from natural language prompts.
Resources
External Resources
Books
- "The Hug" by The Freak Fandango Orchestra - Mentioned as the source for intro and outro music, with a Creative Commons license.
Articles & Papers
- "Love, death, and a drunken monkey" (The Freak Fandango Orchestra) - Mentioned as the source for intro and outro music.
People
- Michael Toy - Co-creator of Malloy, interviewed about the language and its development.
- Lloyd Tabb - Co-creator of Malloy, influential in its design and user experience focus.
- Tobias Macey - Host of the Data Engineering Podcast.
Organizations & Institutions
- Malloy - Open-source language for building composable and maintainable analytics and data models.
- Data Engineering Podcast - Podcast where the interview with Michael Toy took place.
- Google - Acquired Looker; Malloy was developed within Google as a research project.
- Meta - Malloy was used as a research project within Meta.
- Microsoft - Mentioned in the context of a video discussing language transitions.
- Netscape - Where JavaScript was invented, with Michael Toy and Lloyd Tabb involved.
- Looker - Data tool co-founded by Michael Toy and Lloyd Tabb, influential in Malloy's development.
- Cash App - Relies on Prefect for data operations.
- Cisco - Relies on Prefect for data operations.
- Whoop - Trusts Prefect for data operations.
- 1Password - Trusts Prefect for data operations.
- Datafold - Offers an AI-powered Migration Agent for data migrations.
- Bruin - An open-source framework for data integration and workflow management.
- MongoDB - A flexible, unified platform for developers, including AI applications.
- Pro Football Focus (PFF) - Mentioned as a data source for player grading in a bad example.
- New England Patriots - Mentioned as an example team for performance analysis in a bad example.
- NFL (National Football League) - Primary subject of sports discussion in a bad example.
Tools & Software
- Malloy - A modern language for building composable and maintainable analytics and data models.
- SQL - The assembly language of data, which Malloy generates.
- TypeScript - The language in which Malloy is currently written.
- VS Code - Has a Malloy VSCode Plugin for an IDE experience.
- Prefect - Orchestration tool used by Cash App and Cisco for ML models, streaming data, and real-time processing.
- Datafold's Migration Agent - An AI-powered solution for data migrations.
- Bruin - An open-source framework for data integration and workflow management.
- MongoDB - A platform for developers, including AI applications.
- dbt - Mentioned in relation to data transformation pipelines and migration credits.
- Ruby - The language Looker was initially written in.
Websites & Online Resources
- dataengineeringpodcast.com/prefect - URL for information on Prefect.
- dataengineeringpodcast.com/datafold - URL for information on Datafold.
- dataengineeringpodcast.com/bruin - URL for information on Bruin.
- MongoDB.com/Build - URL for starting with MongoDB.
- the-michael-toy.github.io/sudopoet/contact-me - Michael Toy's contact website.
- malloydata.dev - Official website for Malloy.
- linkedin.com/in/lloydtabb - Lloyd Tabb's LinkedIn profile.
- wikipedia.org/wiki/SQL - Wikipedia page for SQL.
- cloud.google.com/looker - Google Cloud Looker page.
- docs.cloud.google.com/looker/docs/what-is-lookml - Documentation for LookML.
- getdbt.com - Website for dbt.
- wikipedia.org/wiki/Relational_algebra - Wikipedia page for Relational Algebra.
- typescriptlang.org - Website for TypeScript.
- ruby-lang.org - Website for Ruby.
- marketplace.visualstudio.com/items?itemName=malloydata.malloy-vscode - Malloy VSCode Plugin.
- docs.malloydata.dev/documentation/setup/cli - Documentation for Malloy CLI.
- docs.malloydata.dev/documentation/language/expressions - Documentation for Malloy Pick Statement.
- freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug - Link to "The Hug" music.
- freemusicarchive.org/music/The_Freak_Fandango_Orchestra/ - Link to The Freak Fandango Orchestra's music.
- creativecommons.org/licenses/by-sa/3.0/ - Creative Commons license link.
Other Resources
- Hierarchical Data - The default mental model Malloy aims to use.
- Semantic Models - A core idea in Malloy, tightly coupled with a query language.
- Composable, Maintainable Language - A goal for Malloy.
- Query Language - Malloy aims to be a composable and maintainable query language.
- Assembly Layer - SQL is considered the assembly layer for data.
- Developer Experience - A focus area for Malloy, including its TypeScript implementation, VS Code integration, and CLI.
- Ecosystem - Malloy aims to build an ecosystem around its language.
- LLM-generated queries - Malloy is a good target for LLM-generated queries.
- Dimensional Filtering - A near-term roadmap area for Malloy.
- Aggregation Strategies - Malloy is working on better aggregation strategies.
- Relational Algebra - A concept that Malloy does not directly focus on in its design.
- ETL - Orchestration tools built for simple ETL are contrasted with Malloy's approach.
- Data Migrations - A problem addressed by Datafold's Migration Agent.
- Composable Data Infrastructure - A concept that Bruin aims to simplify.
- Lineage Tracking - A feature handled by Bruin.
- Data Quality Monitoring - A feature handled by Bruin.
- Governance Enforcement - A feature handled by Bruin.
- ACID Compliant - A characteristic of MongoDB.
- Enterprise-ready - A characteristic of MongoDB.
- AI Apps - MongoDB is designed to help ship AI apps.
- Fortune 500 - Many Fortune 500 companies trust MongoDB.
- Rows and Columns - MongoDB allows thinking beyond traditional rows and columns.
- Semantic Layer - Malloy is considered a semantic layer.
- Metrics Definitions - Malloy aims to integrate semantic modeling and metrics definitions.
- Data Vault - A data modeling technique mentioned in the context of hierarchical data.
- Anchor Modeling - A data modeling technique mentioned in the context of hierarchical data.
- Star Schemas - A data modeling technique mentioned in the context of hierarchical data.
- Snowflake Schemas - A data modeling technique mentioned in the context of hierarchical data.
- PII (Personally Identifiable Information) - Type details that can be flagged and propagated through data workflows.
- Open Source Project - Malloy is an open-source project.
- GitHub - Where Malloy's code is hosted.
- Slack - Where the Malloy community can interact.
- Podcast.net - Another podcast covering the Python language.
- AI Engineering Podcast - A podcast about building AI systems.