Bridging Prototype and Production--Data Science Coding Practices
TL;DR
- Data scientists should extract reusable code into Python scripts to enable easier unit testing and prevent errors, as notebooks often lead to messy, unmanageable code that hinders reproducibility.
- Adopting configuration files for parameters, rather than hardcoding "magic numbers," significantly improves code readability, maintainability, and facilitates experimentation by centralizing changeable values.
- Utilizing tools like Marimo, which renders notebooks as
.pyfiles, bridges the gap between interactive exploration and production-ready code by enabling standard Python practices like unit testing and modular imports. - Dependency management issues, particularly across different operating systems, can cause silent errors or outright failures; modern tools like
uvoffer more robust solutions with lock files for exact reproducibility. - Switching to data manipulation libraries like Polars offers advantages in expressiveness and lazy execution, allowing for more efficient processing of large datasets without loading everything into memory.
- Implementing a standardized project structure with dedicated folders for source code, notebooks, tests, and configurations, often facilitated by tools like Cookiecutter, streamlines project setup and onboarding.
- Committing code frequently and in small increments using Git is crucial for data science projects, allowing for easier rollback of errors and smoother integration through pull requests, preventing merge conflicts.
- Unit tests serve not only to verify functionality across various cases but also act as living documentation, clarifying a function's behavior and expected inputs/outputs for collaborators.
Deep Dive
The core argument of this podcast episode is that data scientists must adopt production-ready coding practices to bridge the gap between prototype development and deployable software, a shift that requires moving beyond the ad-hoc nature of notebooks towards more structured, reusable, and testable code. This transition is crucial for efficient collaboration with machine learning engineers and for ensuring the reliability and scalability of data science projects, ultimately impacting the speed and success of bringing data-driven insights into real-world applications.
The implications of this shift extend across several critical areas. Firstly, the reliance on Jupyter notebooks, while useful for exploratory analysis and visualization, creates significant reproducibility challenges. When code is executed out of order or dependencies are not meticulously managed, the output can vary, leading to silent errors or outright failures for others attempting to replicate the work. This directly impacts collaboration, as machine learning engineers often face substantial rework to make data science code production-ready. The episode highlights that adopting a project structure with dedicated source code, notebook, and test directories, alongside tools like uv for dependency management and pytest for testing, transforms this workflow. uv, in particular, addresses the common pain point of dependency conflicts and ensures consistent environments, a stark contrast to the often-unpredictable nature of pip installations within notebooks.
Secondly, the concept of modularity and reusability is paramount. By refactoring reusable code into Python scripts and functions, data scientists can import these components into notebooks or other scripts, promoting cleaner code, easier testing, and reduced duplication. This is further reinforced by the introduction of configuration files (e.g., YAML) to manage hardcoded values, often termed "magic numbers." Separating these parameters from the core logic makes code more readable, maintainable, and experiment-friendly, as values can be adjusted without altering the underlying code. Tools like cookiecutter facilitate the creation of standardized project templates, automating the setup of these structured environments and reducing the initial overhead for new projects.
Finally, the episode advocates for a paradigm shift towards writing testable code, even within the data science context. This involves breaking down functions into smaller, single-responsibility units and embracing testing frameworks like pytest. The benefits are twofold: ensuring code functions correctly across various inputs, preventing subtle errors that might escape manual inspection, and serving as living documentation. The introduction of tools like Marimo is presented as a significant step forward, offering the interactive interface of notebooks while generating .py files under the hood. This hybrid approach allows for the benefits of scripting--like unit testing and dependency tracking--within a familiar notebook-like experience, automatically managing cell dependencies and preventing common pitfalls of traditional notebooks, such as repeated variable names that break reproducibility. The underlying theme is that while prototyping in notebooks is acceptable, production-bound code demands discipline, structure, and rigorous testing, transforming data science from an isolated craft into an integrated engineering discipline.
Action Items
- Create project template: Include directory structure (src, notebooks, tests, config) and
pyproject.tomlwithuvandruffconfigurations for new data science projects. - Implement configuration files: Separate hardcoded values into YAML files to improve code readability, reusability, and testability.
- Refactor notebook code: Move reusable functions and classes from notebooks into Python scripts for better modularity and testability.
- Audit dependency management: For 3-5 critical projects, compare
pipanduvdependency resolution to identify potential silent errors or inconsistencies. - Draft unit tests: For 5-10 core functions identified in refactored code, write unit tests using
pytestto ensure functionality across various inputs.
Key Quotes
"I saw the pain of machine learning engineer so my manager he's a very good data scientist but he's not very good at coding so a lot of time he would the machine learning engineer needs to like rewrite a lot of his code so I want to like you know how do I write the code in a way that when I give it to him he can immediately like just minimal edit and then use it for production that's one thing another thing I found is it's a pain for me to you know write a lot of code in a notebook and then later try to understand what I wrote and also you know if I run my notebook in different orders then it will show different output so it's very messy at the same time it's not reusable I cannot it's really hard for me to reuse it for a different project that use similar pieces of code like how do I extract it out so that I can reuse different components and another thing that goes with it is you know you need testing right how do I make sure that this code works for different circumstances so the book is about kind of good Python practices you can say for data scientists including writing Python variables Python classes Python functions like when should you even use Python classes because sometimes you know you don't it's not really necessary for data science projects also you need testing also like hard coding right like a lot of time like in Python projects in a data science project you see a lot of global variables being used hard coded variables and um I offer the alternative which is using a configuration so you have a configuration file where you put all the values so you can very easily change it right but at the same time you have your all code logic in a different place so that it looks very clean and also you can unit test independent of the hard coded values."
Khuyen Tran explains that the motivation for her book, "Production Ready Data Science," stems from her experience as a data scientist in a startup. She observed the difficulties machine learning engineers faced when trying to use code written by data scientists, often requiring significant rewrites. Tran also highlights the personal challenges of managing code in notebooks, including issues with reproducibility, understanding past work, and code reusability. The book aims to provide data scientists with good Python practices to create code that is immediately usable by engineers and is more organized and testable.
"I really like uv because of the fact that it have two files one is the lock files that you know that lock the dependencies which is like if you wanted to reproduce the exact dependencies you can do it but at the same time it had a pi project com with more flexible dependencies so yeah that's become that's why I really enjoy uv is also very fast yeah yeah the lock files would solve that say potentially pandas issue you're mentioning before yeah now you know you're going to get the exact version yeah which which is good yeah and also I would say not the problem and I I wrote it in the book but like with pip right like compare uv with pip for example a lot of the time you know when you run pip install something and then like there's a dependency issue but it's like install anyway the dependency conflict that is not like something to be ignored and if there's like there could be like potential downstream issues that come with like different like you don't match dependencies that the package required but with uv like if that happens it's gonna stop you right so you can try to find a way to match the dependencies of the packages that you use together you feel like it it presents you with information that you can kind of work off of compared to pip yes in that sense okay yeah yeah yeah"
Khuyen Tran expresses a strong preference for the tool uv as a dependency manager. She highlights that uv uses both lock files for exact dependency reproduction and pyproject.toml for more flexible dependency management, contributing to its speed and reliability. Tran contrasts this with pip, noting that pip can sometimes install packages despite dependency conflicts, which can lead to downstream issues. uv, on the other hand, is described as stopping the process when conflicts arise, enabling users to resolve them and ensure better compatibility between packages.
"Marimo is solved that exact problems right and oh also another thing Jupiter notebook is json based and it's like oh Marimo it allows you to write it as a notebook but then at the same time it's a dot py file like that is amazing like you have the best of both worlds the interactivity and you have a python script and if it needs a python script under the hood then you can do whatever you want with it just like a python script you can do unit testing very easily you you can just import it the functions or whatever from that notebook right like the dot py notebook yes exactly good practice also like another thing I really like Marimo like if you change you know like in notebook if you change something in one cell you you need to remember like which one depend on that one so that it of course you can run everything but you probably don't want to run everything if your notebook is very heavy so you run the dependent cells and to do it manually it's very difficult right to remember that but with Marimo it automatically run the dependent cells I also like the fact that it's like keep track of which variable like the dependencies between cells and oh another thing is it integrated very well with uv right so like if you have uv and you have Marimo it's like very nice yeah because one problem with notebook is you would like pip install something right I see people do that like in the notebook like pip install something yeah and it might even do it like deeper into the notebook like in cell 25 or something like that you know and they don't really specify the version of the package right or like there's no what what I try to say is it says like it says let's say you pip install pandas and then later I don't know one year later maybe you run that again with pip install pandas but it will be a different pandas version right if it installed from scratch so there are dependencies issues that would a notebook but with Marimo what happened is you don't even do pip install so let's say you import something like you let's say you import plotly right and then it will detect it and you say oh hey you're missing plotly dependencies like you want to install it and then when you install it it is tracked so what happens is the next time it's a track in like the top of the notebook which is a py file so the next time you can do it it used with uv so uv I don't remember the exact command but like uv run something something but like it reproduce exactly because you know it the code stay the same if the dependencies are exactly the same you will have the same output as you've shared these techniques with other people what has been the reaction of it like what have people thought of you know when you you you do a lot of writing you get a lot of comments on your your site and maybe even back with your other data science comrades or colleagues what would they say about Marimo and the changes there they I think they like it it does take some time like when I heard I told them about the features like oh this is amazing I want
Resources
External Resources
Books
- "Production Ready Data Science" by Khuyen Tran - Mentioned as a handbook for data scientists on turning prototyping into production-ready code.
Articles & Papers
- "The top six Python libraries for visualizations" (Real Python) - Mentioned as an article previously featured on another show.
People
- Khuyen Tran - Author of "Production Ready Data Science" and guest on the podcast.
- Christopher Bailey - Host of The Real Python Podcast.
- Leidiana's Pozzo Ramos - Author of a Real Python tutorial on Python project management with uv.
- Rodrigo Girao Soriano - Presenter of the Real Python video course on Python project management with uv.
- Marco Gorelli - Creator of the
nowellspackage for interoperability between data frame libraries.
Organizations & Institutions
- CodeCut AI - Khuyen Tran's website and newsletter.
- Real Python - Platform for Python tutorials, articles, and video courses.
- Medium - Blogging platform where Khuyen Tran previously wrote.
- Pypi - Package repository for Python packages.
Courses & Educational Resources
- Python Project Management with uv (Real Python Video Course) - Course covering the use of uv for package and project management, dependency installation, virtual environments, and publishing packages.
Tools & Software
- uv - Python package and project manager, alternative to Poetry and Pip.
- Rough - Python formatter and linter.
- Cookiecutter - Tool for creating projects from templates.
- Pandas - Python library for data manipulation and analysis.
- Polars - DataFrame library known for speed and expressive syntax.
- PySpark - DataFrame library.
- Matplotlib - Python plotting library.
- Plotly - Data visualization library.
- Marimo - Notebook environment that allows writing notebooks as
.pyfiles. - Jupyter Notebook - Interactive computing environment.
- Git - Version control system.
- GitHub - Platform for version control and collaboration.
- Pytest - Python testing framework.
- YAML - Data serialization language used for configuration files.
- Langchain - Framework for developing applications powered by language models.
- LangIndex - Tool for indexing and searching language model data.
- LangGraph - Tool for building stateful, multi-actor applications with LLMs.
Other Resources
- CI pipeline (Continuous Integration) - Automated testing and integration process for code changes.
- Configuration files - Files used to store parameters and settings for projects.
- Magic numbers - Undocumented, hard-coded numerical values in code.
- Test-Driven Development (TDD) - Software development process where tests are written before code.
- LLM (Large Language Model) - AI models trained on vast amounts of text data.