Moving From Manual Observability To Causal World Modeling
The Hidden Complexity of AI-Driven Production
AI coding agents have changed the bottleneck of software development from creation to operation. While generating code is now easy, the systems that code inhabits have become fragile. The core issue is that AI-generated code often works in isolation but fails unpredictably where systems interact. This conversation shows that the traditional observability-first approach is no longer enough. The sheer volume of telemetry data, combined with the complexity of agentic systems, has outpaced human cognitive capacity. Readers who move from manual dashboard monitoring to automated, causal world modeling will gain an advantage in maintaining system reliability as their infrastructure scales.
The Illusion of Correctness at Scale
The primary risk of AI-augmented development is not bad code, but the decay of architectural understanding. As Anish Agarwal, CEO of Traversal, notes, AI agents produce code that often appears functionally correct to a human reviewer, creating a false sense of security. However, this local correctness masks a deeper, systemic vulnerability.
Despite doing what you want to do functionally for that one task that you wanted to do, your understanding how it is going to behave overall has dropped because you are just seeing did it get the thing done at that specific point in time.
-- Anish Agarwal
The consequence is a delayed payoff. Teams ship faster, but they accumulate architectural debt that only surfaces when the system interacts with external variables like weather, traffic, or other services that the AI agent never accounted for. Conventional wisdom suggests that engineers should own their code in production, but Agarwal argues that in a microservices environment, the code itself is secondary to the interaction patterns between services. When these interactions break, the traditional manual debugging process, which relies on tribal knowledge and static dashboards, fails to scale.
Why More Data Is a Red Herring
Many organizations respond to production instability by instrumenting more, hoping that a higher volume of logs or metrics will reveal the root cause. This is a trap. Agarwal suggests that the problem is not a lack of data, but a lack of contextual processing.
Most observability tools are built for human-centric dashboarding, which involves pre-decided cuts of data that an engineer monitors. Agentic systems, however, require a different query pattern. They need to perform soft joins of semantic data across logs, metrics, and traces to identify causal dependencies.
The observability data that most companies have that are of a certain scale is more than enough to go deal with this issue. You do not need to go produce more data.
-- Anish Agarwal
By mining existing telemetry to build a production world model, teams can move beyond simple pattern matching. This model acts like a gym for reinforcement learning, allowing the system to simulate causal hops across the infrastructure without requiring constant manual re-instrumentation. The advantage belongs to those who shift from collecting data to mapping the causal relationships within it.
The Trade-off Between Autonomy and Trust
The path to self-driving production is not a sudden transition but a series of incremental steps in change management. Agarwal categorizes this into levels of autonomy, similar to self-driving cars. The mistake many teams make is attempting to jump straight to L5, or fully autonomous self-healing, without first establishing the world model required to prevent hallucinations.
The competitive advantage lies in the middle ground: using AI to automate the laborious parts of maintenance, such as summarizing incidents, pulling historical context, and executing known runbooks. This creates a faster horse effect in the short term, but the long-term advantage is built by the team that integrates these autonomous agents into their CI/CD pipeline to forecast how code will behave in production before it is deployed.
Key Action Items
- Audit your human-in-the-loop bottlenecks: Identify the top 5 repetitive tasks your SREs perform during an incident, such as creating channels or querying logs. Automate these via LLM agents over the next quarter to free up cognitive bandwidth.
- Shift from dashboards to causal queries: Stop adding new metrics that nobody looks at. Instead, invest in tools or internal processes that can perform soft joins across your existing logs and traces to map system dependencies.
- Implement pre-flight evaluations: Begin evaluating AI-generated code not just for syntax, but for production behavior. Use your existing world model to forecast how new code will impact existing SLAs before merging. This pays off in 12 to 18 months as your system complexity grows.
- Decouple data types: Stop treating logs, metrics, and traces as silos. Use LLMs to semantically link these data types to provide a unified view of system health.
- Adopt an L3 troubleshooting framework: For the next 6 months, focus on building agentic workflows for specific, well-defined classes of issues, such as Kubernetes networking errors, rather than trying to solve the general incident problem.