DiskCache: Persistent, Cross-Process Caching for Python Applications
TL;DR
- DiskCache leverages SQLite for persistent, cross-process caching, enabling efficient reuse of expensive computations and data across applications without the overhead of network-bound services like Redis.
- By persisting cache data to disk, DiskCache overcomes the limitations of in-memory caches like
functools.lru_cache, allowing cached results to survive process restarts and be shared among multiple application instances. - DiskCache's decorator-based memoization and dictionary-like interface simplify the integration of caching into Python code, reducing redundant computations for LLM calls, data processing, and web request handling.
- The library offers advanced features like configurable eviction policies, custom serialization, and sharding (via
fan_out_cache), providing flexibility to optimize storage and performance for diverse use cases, from web apps to data science notebooks. - DiskCache's performance can rival or exceed network-based caches like Redis for local operations, making it a compelling choice for developers seeking to improve application speed and reduce cloud infrastructure costs by utilizing local SSD storage.
- The library's integration with frameworks like Django via
diskcache.DjangoCacheallows for seamless adoption into existing web applications, replacing traditional caching backends with a disk-based, high-performance alternative. - DiskCache's ability to handle complex Python objects through pickling, with optimizations for native types and support for custom serializers like JSON with compression, ensures broad compatibility and efficient data storage.
Deep Dive
DiskCache offers a powerful, yet simple, solution for accelerating Python applications by persisting computation results to disk, thereby avoiding redundant, expensive operations. While traditional caching often relies on in-memory solutions like functools.lru_cache or network-dependent services like Redis, DiskCache leverages SQLite to provide robust, cross-process, and persistent storage, making it a versatile tool for everything from web applications to data science workflows. Its ease of integration, particularly via a decorator, significantly lowers the barrier to entry for implementing significant performance gains, allowing developers to focus on core logic rather than cache management.
The core utility of DiskCache lies in its ability to act as a persistent dictionary, storing any pickleable Python object directly to disk. This persistence means that cached data survives application restarts, a critical advantage over in-memory caches. This capability is particularly impactful in scenarios involving computationally intensive tasks, such as LLM calls, expensive database queries, or complex data transformations. By caching the results of these operations, developers can dramatically reduce execution times and associated costs. For instance, repeated calls to an LLM for the same input can be instantly resolved from the cache, saving both time and money. The library also offers advanced features like cache expiration, configurable eviction policies, and support for multiple, independently managed cache files, allowing for fine-grained control over data management and storage.
DiskCache's implications extend to how applications are architected and deployed. Its ability to function as a shared cache across multiple processes, facilitated by its SQLite backend and cross-process safety, is especially beneficial for web applications. Unlike in-memory caches that are isolated to individual processes, DiskCache allows a cluster of web servers to share cached data, preventing redundant computations and reducing load on backend services like databases. Furthermore, the library's performance, often rivaling or even exceeding that of in-memory solutions like Redis due to the elimination of network latency, suggests a paradigm shift in how developers consider performance optimizations. The ability to use inexpensive disk space as a substitute for costly RAM for caching can lead to significantly leaner and more cost-effective deployments, especially in cloud environments where storage is considerably cheaper than memory. This makes DiskCache a compelling alternative for many caching needs, democratizing high-performance caching for a wider range of Python projects.
Action Items
- Audit disk cache usage: Identify 3-5 core functions or modules benefiting from caching (e.g., LLM calls, markdown rendering, data processing) to prioritize implementation.
- Implement disk cache decorator: Apply
@diskcache.memoize()to 2-3 identified functions to automatically cache results and reduce redundant computation. - Create separate cache directories: For 3 distinct caching needs (e.g., LLM responses, static asset generation, data transformations), establish unique cache directories to improve organization and manageability.
- Evaluate cache eviction strategy: For caches exceeding 10,000 items, configure an eviction policy (e.g., LRU, TTL) to prevent excessive disk space usage.
- Benchmark custom serialization: For text-heavy data, implement and test a custom JSON or compressed serialization strategy (e.g., using zlib) for 1-2 key caching scenarios to assess disk space savings.
Key Quotes
"your cloud ssd is sitting there bored and it would like a job today we're putting into work with diskcache a simple practical cache built on sql lite that can speed things up without spinning up redis or other extra servers once you start to see what it can do a universe of possibilities opens up we're joined by vincent warmerdam to dive into diskcache"
Vincent Warmerdam introduces diskcache as a practical caching solution built on SQLite. The author highlights that diskcache can significantly speed up operations without requiring additional services like Redis. This suggests diskcache offers a more integrated and potentially simpler approach to performance optimization.
"i think the classical thing you would do in python is you have this decorator in functools i think right the lru lru underscore cache yeah exactly yeah that's the hello world to that one but then outside of that thing is that it's all in memory so then if you were to reboot your python process you lose all that caching so that's why people historically i think resorted to i think redis i think is the most well known caching tool"
Michael Kennedy explains that Python's built-in lru_cache is an in-memory solution, meaning cached data is lost upon process restart. He notes that historically, developers have turned to external tools like Redis for persistent caching, indicating a common need for solutions that survive application restarts.
"so like the the simplest way i usually describe it it really behaves like a dictionary except you persist to disk and under the hood is using sql lite i think that's the it doesn't cover everything but you get quite close if that's the way"
Vincent Warmerdam describes diskcache's core functionality as a dictionary-like interface that persists data to disk using SQLite. This interpretation emphasizes diskcache's ease of use, making complex disk-based storage feel as familiar as working with in-memory Python dictionaries.
"and it's operationally it's a separate thing that you have to run it has to have both it has to be secure because if your data gets oh we're talking about like so for postgress is it not for sql light per se but for postgress yes like it's running somewhere people can ssh in if you're not careful you got to be mindful of passwords and all that stuff that's totally true right and it can go down like it could just become unavailable because you've screwed up something or it or whatever right it's a thing you have to manage and the complexity of running your app when it's like well it used to just be one thing i could run in a docker container well now i got different servers i got to coordinate and there's firewalls and there's like it just it just takes it so much higher in terms of complexity that like sql light is a file yes you know what i mean"
Michael Kennedy contrasts the operational complexity of managing a separate database server like PostgreSQL with SQLite. He highlights that PostgreSQL requires managing security, availability, and inter-server communication, whereas SQLite, being file-based, simplifies deployment and management.
"so i have this notebooks repository where i have llms just write fun little notebooks i always check the results obviously just to be clear on that but one thing that was i think pretty cool to see like if you just do the normal data type and you pickle it then you get like a certain size and if you were to like if you just have a very short normal python dictionary basic thing then it's like negligible like you shouldn't use this json trick but the moment you get text heavy like there's just like a lot of text that you're inputting there and there's some repetition of characters or like if you really do something that's highly compressible it is not unheard of to get like like 80 90 savings on your disk space basically"
Vincent Warmerdam discusses using custom serialization with diskcache, specifically mentioning JSON compression with zlib. He explains that for text-heavy data with repetition, this approach can lead to significant disk space savings, demonstrating how diskcache's flexibility can optimize storage for specific data types.
"i mean the new nvme vvm whatever disks ssd disks are so fast and also they're not memory and memory is expensive nowadays yeah it is people want to build data centers with them i've heard yeah yeah and on the cloud this is a totally this is another really interesting aspect to discuss probably more of a web dev side of things but if you do lru cache or even to a bigger degree i run a whole separate server even if it is just a docker container that holds a bunch of the stuff in memory that's going to take more memory on your vm or your cloud deployment or whatever and if you just say well i have this 160 gig hard drive that's an nvvm high speed drive like maybe i could just put a bunch of stuff there and you can really thin down your deployments not just because it's not in memory and a cache somewhere but if you're not having any form of cache you might be able to dramatically lower how much compute you need and avoid the memory right like there's layers of how this could like shave off and again it's one of those things of like oh i just uh can i pay for disk instead oh that's a whole lot cheaper what do what else do i gotta do you just gotta write a decorator"
Michael Kennedy argues that with the increasing speed and decreasing cost of SSDs compared to RAM, using disk-based caching becomes more economically viable. He suggests that by leveraging disk cache, developers can potentially reduce memory requirements and overall compute needs, leading to thinner and more cost-effective deployments.
Resources
External Resources
Books
- "LLM Building Blocks" by Vincent Warmerdam - Mentioned as the source of an enthusiastic discussion about diskcache.
Articles & Papers
- "Code Archaeology" (Notebook) - Mentioned as a project that uses diskcache to generate charts showing code changes over time.
People
- Vincent Warmerdam - Guest, developer at Marimo, and author of the "LLM Building Blocks" course.
- Michael Kennedy - Host, PSF fellow, and coder with over 25 years of experience.
- Jeremy Howard - Mentioned in relation to the cloud provider Plash.
Organizations & Institutions
- Marimo - Company where Vincent Warmerdam works, developing modern Python notebooks.
- Plash - A Python-focused cloud provider offering persistent SQLite as a database.
- OpenAI - Mentioned in relation to open-weight LLM models.
- S3 - Mentioned as a blob storage service for backups.
- AWS (Amazon Web Services) - Mentioned in relation to S3.
- Digital Ocean - Mentioned as a cloud provider.
- Fly.io - Mentioned as a cloud provider.
- GitHub Pages - Platform used to host charts generated by the "Code Archaeology" notebook.
- Sentence Transformers - Mentioned as a project where charts show a move from an academic lab to Hugging Face.
- Hugging Face - Mentioned as the current home for the Sentence Transformers project.
- Scikit-learn - Mentioned as a project with a significant code change when a new maintainer took over.
- Django - Mentioned as a framework with a large codebase and a diskcache Django cache plugin.
- Redis - Mentioned as a popular caching tool.
- Memcached - Mentioned as another caching tool.
- Postgres - Mentioned as a database that can be used for caching.
- Mongo DB - Mentioned as a database used in the host's website backend.
- YouTube - Mentioned in relation to parsing YouTube IDs for embedding.
- Amazon - Mentioned in relation to S3.
- Google - Mentioned in relation to its product lifecycle.
- AWS (Amazon Web Services) - Mentioned in relation to S3.
- Microsoft - Mentioned in relation to its product lifecycle.
- Python Software Foundation (PSF) - Michael Kennedy is a fellow.
- National Football League (NFL) - Mentioned in a previous podcast episode context.
- New England Patriots - Mentioned in a previous podcast episode context.
- Pro Football Focus (PFF) - Mentioned in a previous podcast episode context.
Tools & Software
- diskcache - A Python caching library built on SQLite.
- SQLite - A database engine used by diskcache.
- Redis - A popular in-memory data structure store, used as a cache.
- Memcached - An in-memory distributed cache.
- Postgres - A relational database management system.
- Mongo DB - A NoSQL document database.
- Jupyter - A notebook environment.
- Docker - Containerization platform.
- Docker Compose - Tool for defining and running multi-container Docker applications.
- Git - Version control system.
- Sphinx - A documentation generator.
- Make Docs - Mentioned as a replacement for Sphinx in Scikit-learn.
- Boto3 - An AWS SDK for Python.
- Parquet - A columnar storage file format.
- Polars - A DataFrame library.
- DuckDB - An in-process analytical data management system.
- Celery - A distributed task queue.
- RabbitMQ - A message broker.
- Lightstream.io - A service for backing up SQLite databases to S3.
Websites & Online Resources
- talkpython.fm - Website for the Talk Python To Me podcast, hosting past episodes.
- talkpython.fm/youtube - YouTube channel for live streams of the podcast.
- PyPI (Python Package Index) - Repository for Python packages.
- GitHub - Platform for hosting code repositories.
Other Resources
- SQL - Query language.
- LLMs (Large Language Models) - Mentioned as a use case for caching due to compute cost and slowness.
- Serialization - The process of converting an object into a format that can be stored or transmitted.
- Pickle Serialization - Python's built-in serialization method.
- HTTP Caching - Caching of web requests.
- CDN (Content Delivery Network) - Used for caching static assets.
- S3 (Amazon Simple Storage Service) - Blob storage service.
- Blob Storage - A type of cloud storage.
- Transactions - A sequence of operations performed as a single logical unit.
- Concurrency - The ability of different parts of a program to be executed out-of-order or in parallel.
- Thread Safety - Ensuring that a piece of code can be executed by multiple threads simultaneously without causing data corruption.
- Process Safety - Ensuring that a piece of code can be executed by multiple processes simultaneously without causing data corruption.
- Fan out cache - A type of cache that uses sharding to distribute data across multiple instances.
- Sharding - A database technique for partitioning data across multiple tables or servers.
- Django Cache - A Django-compatible cache interface using diskcache.
- Deque (Double-ended queue) - A data structure that allows adding and removing elements from both ends.
- Queue - A data structure that follows the First-In, First-Out (FIFO) principle.
- Task Queue - A system for managing and executing background tasks.
- Diskcache Index - Creates a mutable mapping in an ordered dictionary.
- Barriers, Throttling, Semaphores - Synchronization primitives used in concurrent programming.
- Eviction Policies - Strategies for removing items from a cache when it is full.
- Tags - A feature in diskcache for categorizing cached items.
- Expiry - Setting a time limit for cached items.
- Max Key Size - Limiting the number of items in a cache.
- Cache Size - Limiting the total disk space used by a cache.
- Recent Last Recently Stored - Default eviction policy for diskcache.
- Least Recently Used (LRU) - An eviction policy that removes the least recently accessed item.
- JSON (JavaScript Object Notation) - A lightweight data-interchange format.
- Zlib - A data compression library.
- Text Compression - Reducing the size of text data.
- Embeddings - Numerical representations of data, often used in machine learning.
- Float 16 - A data type with reduced precision for floating-point numbers.
- Quantiles - Values that divide a dataset into equal parts.
- Parquet - A columnar storage file format optimized for disk representation.
- Arrow - An in-memory columnar data format.
- Columnar Format - A data storage format where data is organized by columns rather than rows.
- Partitions - Dividing data into smaller, manageable chunks.
- ACID Compliance - Properties of database transactions (Atomicity, Consistency, Isolation, Durability).
- Linter - A tool that analyzes code for stylistic errors and potential bugs.
- Async - Asynchronous programming.
- Naming things, Cache invalidation, Off-by-one errors - Common challenges in software development.