Durable Execution: Essential Foundation for AI Agents
The AI Agent Era Demands a New Foundation: Why Durable Execution is No Longer Optional
The conversation with Tamar Avishai, CEO of Temporal, reveals a critical, often overlooked, infrastructure requirement for the burgeoning AI agent landscape: durable execution. While the immediate allure of AI agents lies in their ability to automate complex tasks and generate code, Avishai illuminates the hidden consequences of their failures. When these long-running, token-intensive processes falter, the cost isn't just a minor inconvenience; it's a significant loss of money and time. This discussion is essential for any technical leader, product manager, or engineer building or relying on AI agents, offering a strategic advantage by highlighting the foundational necessity of systems that can reliably handle inevitable failures, thereby unlocking the true potential of autonomous AI.
The Unseen Cost of Agentic Chaos: Why "Starting Over" Is a Catastrophic Failure
The current excitement around AI agents is palpable, with capabilities ranging from generating code to conducting deep research. However, the underlying infrastructure to support these increasingly complex and expensive operations is often an afterthought. Tamar Avishai, CEO of Temporal, argues that the prevailing mindset of simply "starting over" when an AI agent fails is a fundamentally flawed approach, especially as these agents undertake longer, more resource-intensive tasks. This isn't merely about efficiency; it's about economic viability and the practical realization of AI's promise.
Consider the analogy of a busy restaurant kitchen. Orders come in, each with a unique sequence of steps. A restaurant's primary concern is that every order is processed correctly and delivered to the customer exactly once. The kitchen staff doesn't necessarily focus on the minutiae of every potential failure -- a station going down, a chef taking a break, or a new staff member lacking context. Their focus is on the business logic of preparing each dish. Durable execution, as provided by Temporal, acts as the unseen orchestrator, ensuring that despite the inevitable chaos of a busy kitchen (or a complex distributed system), each order is processed to completion.
"at the end of the day there is a very clear outcome that the restaurant is looking for every order gets processed in a very specific sequence of steps and eventually every order translates to being delivered to a customer exactly once this is exactly what durable execution provides we completely abstract out state management for you as a developer building an order management system you just code up your business logic and we we are the execution authority of making sure every order gets processed exactly once in the presence of all sorts of chaos and failures in the system"
This abstraction is precisely what’s missing in many AI agent implementations. When an AI agent, tasked with a three-hour deep research job that consumes thousands of tokens, fails midway, the entire investment is lost. The initial implementation of Temporal, born out of necessity at Uber, addressed this by creating a system that could remember the state of any running process and recover it seamlessly after a failure. This capability is no longer a "nice-to-have" but a "mission-critical" requirement as AI agents become more autonomous and expensive.
The Long Shadow of Distributed Systems: From Uber's Microservices to Agentic Loops
The genesis of Temporal lies in the complex, hyper-growth environment of Uber, which had, by some accounts, more microservices than engineers. Building stateful applications in such a distributed environment led to a significant state management mess. The initial solution, Cadence, was designed to guarantee the execution of functions from start to finish, abstracting away infrastructure failures for developers. This proved invaluable for use cases like a three-day retry policy for a banking API integration, where Kafka's stream-processing nature wasn't suited for such long-running retries.
A more extreme example at Uber was the loyalty program. Workflows ran "forever" for each rider, tracking trips and awarding points. The entire state for each rider was kept within the workflow itself, with no database backing. This system faced a critical bug where a trip completion event could reset a rider's points to zero. While the bug was rolled back, corrupted workflows remained. Temporal's event-sourcing foundation allowed them to go back in time, reset workflows to a specific point before the bug, and replay events with the corrected code, recovering all corrupted workflows. This capability--handling versioning, long-lived applications, and recovery from user errors--is where Temporal shines, offering a level of reliability that few other technologies can match.
"because we know when the workflow made forward progress what was the build id so we actually could reset the workflow go back in time and reset the workflow to that point and replay all of the events after that point with the new code and suddenly you recovered all of the corrupted workflows"
This ability to manage state and recover from errors is directly transferable to the agentic world. As AI agents move from short, interactive prompts to long-duration background processing, their need for durability and long-running state management becomes paramount. The "agentic loop"--where a model plans and invokes tools--maps directly to a Temporal workflow. This is particularly relevant as agents become more asynchronous and perform more meaningful work without continuous human oversight.
The Agentic Stack: From "MS-DOS Era" to Swarms of Specialized Intelligence
Avishai likens the current state of AI agents to the "MS-DOS era"--functional within a sandbox but limited. The future, he predicts, is a "swarm of agents" collaborating to solve complex problems. This shift necessitates a robust execution authority capable of handling state management across multiple agents. Temporal is uniquely positioned here, not by creating another platform, but by integrating with existing ones and providing the underlying state management, scalability, and reliability.
The rise of coding agents, which are becoming increasingly long-lived and capable of producing complex software without direct human coding, is a prime example. OpenAI's Codex, for instance, leverages Temporal to orchestrate various tools in the background, handling millions of executions, spikes, and failures. This demands an execution authority that can manage these complex patterns at scale.
"The way we are typically talk about that is these models tell you what to do but you need an execution authority how does that work gets done yeah at a very large scale that's the next problem right"
Deep research agents also benefit immensely. They often involve interaction with humans, requiring context to be maintained across multiple interactions. Temporal handles this state management, preventing loss of work. Furthermore, these agents frequently perform parallel tasks (e.g., scraping data, making API calls) and then consolidate the results, a complex orchestration logic that Temporal naturally supports. The recoverability and state management are crucial, especially given the token costs associated with these long-running research tasks.
The trend towards specialization is also evident. Instead of reinventing the wheel for tasks like airline reservations or hotel bookings, specialized agents will handle these functions. The challenge then becomes orchestrating calls across dozens of these specialized agents to achieve end-to-end business outcomes--a distributed systems problem at scale. Temporal's vision of "durable RPC" aims to address this need for stitching together swarms of agents and managing state across them.
Beyond Execution: Observability as a Free Benefit
A significant, often overlooked, benefit of Temporal's event-sourcing model is the built-in audit trail. Every step of a workflow's execution is recorded, providing complete visibility into agent behavior. This is invaluable for debugging, improving agent performance, and even for business analytics. Organizations can map business transactions to Temporal workflows, sharing execution histories to visualize progress and identify issues. For non-deterministic AI agents, this observability is critical for understanding their actions and implementing guardrails.
"everything in your system is auditable and you get so much visibility into your business transactions now just because of that nature"
This comprehensive record of execution history can power entirely new product surface areas for business analytics, identifying slow tool invocations or high failure rates. While observability has always been a challenge, especially in cloud architectures, the explosion of AI agents is pushing these boundaries to new scales, requiring creative solutions.
The Future of Agents: From Sandboxes to Durable Swarms
The future of long-running agents, as envisioned by Avishai, involves breaking free from sandboxes and operating as a swarm. This necessitates a robust, industry-wide standard for durable RPC and asynchronous tool invocation. Temporal is investing in initiatives like Project Nexus to drive such standards, recognizing that this is a problem the entire industry faces.
While some believe SaaS is dead, Avishai argues that value will shift to APIs and agents that expose core business logic and data. Traditional enterprises, with their well-defined business processes and systems of record, will likely adopt a hybrid approach, using deterministic orchestration for core processes and agents for automatable tasks. Temporal's ability to provide both a system of record and an execution authority makes it applicable to both scenarios.
The advice for founders today, gleaned from Temporal's journey through the 2021 boom and subsequent market correction, is clear: focus on solving customer needs and delivering tangible value. While capital is important, resilience and strong margins are paramount. Temporal's success lies in its unwavering focus on providing a durable, reliable foundation for increasingly complex applications, a foundation that is becoming indispensable in the age of AI agents.
Key Action Items:
- Implement Durable Execution for Long-Running AI Tasks: Prioritize integrating platforms like Temporal for any AI agent performing tasks that are expensive to restart (e.g., deep research, complex code generation, multi-step workflows).
- Immediate Action: Identify current AI agent workflows that are prone to failure and assess the cost of restarts.
- Investment (6-12 months): Begin piloting durable execution solutions for these critical workflows.
- Abstract State Management: Treat AI agent state management as a core infrastructure problem, not an application-level concern.
- Immediate Action: Document the state required for your key AI agents and how it is currently managed.
- Investment (3-6 months): Evaluate solutions that provide automated, transparent state management for agent workflows.
- Leverage Observability for Agent Improvement: Utilize the built-in audit trails and execution histories provided by durable execution platforms to understand agent behavior, identify bottlenecks, and improve performance.
- Immediate Action: Begin tracking key metrics for AI agent failures and restart points.
- Investment (6-12 months): Integrate observability tools that can ingest and analyze workflow execution data for actionable insights.
- Plan for Agent Swarms and Durable RPC: Anticipate a future where multiple specialized agents collaborate. Investigate or contribute to standards for asynchronous communication and state management between these agents.
- Longer-Term Investment (12-18 months): Research and experiment with patterns for orchestrating multiple agents.
- Focus on Core Business Value, Not Just Growth: Emulate Temporal's resilience by prioritizing solving genuine customer problems and building a sustainable business model, rather than purely chasing growth at all costs.
- Immediate Action: Re-evaluate current growth strategies against customer value delivery and profitability.
- Longer-Term Investment (Ongoing): Cultivate a culture of disciplined execution and financial prudence.
- Embrace Specialization and Orchestration: Recognize that the future involves specialized agents, and the key value will lie in orchestrating them effectively to deliver end-to-end business outcomes.
- Immediate Action: Identify discrete tasks within your business processes that could be handled by specialized agents.
- Investment (9-15 months): Explore how to integrate and orchestrate these specialized agents to create new business capabilities.
- Invest in Developer Experience for Production Readiness: Simplify the loop for developers to build, test, run, and deploy agents to production, ensuring reliability and scalability from the outset.
- Immediate Action: Map the current developer workflow for building and deploying agents.
- Investment (6-12 months): Invest in tooling and processes that streamline the path to production for AI agents.