Resilience Engineering: Designing Systems for Failure Not Prevention

Original Title: The Joy of Unplugging Cables: Kelly Shortridge on Security Resilience

Hanselminutes with Scott Hanselman · April 23, 2026 · Listen to Original Episode →

In this conversation, Kelly Shortridge, CPO at Fastly, challenges conventional security wisdom by advocating for resilience engineering and designing systems for failure rather than solely for prevention. The core thesis is that perfect security is an illusion; true robustness lies in how systems recover from inevitable disruptions. This perspective reveals hidden consequences: focusing solely on prevention breeds a false sense of control, making organizations brittle when unexpected events occur. The conversation highlights how traditional compliance can actively hinder security and how software's unique ability for simulation is vastly underutilized. This analysis is crucial for technical leaders, security practitioners, and anyone building complex systems who seeks to move beyond superficial metrics towards genuine, adaptable resilience.

The Illusion of Prevention: Embracing Failure as a Design Principle

The prevailing security mindset often centers on building impenetrable walls, a strategy that, as Kelly Shortridge argues, fundamentally misunderstands the nature of complex systems. The pursuit of perfect prevention is not only unattainable but actively detrimental, fostering a brittle confidence that crumbles under the weight of inevitable failures. This perspective shift, from prevention to resilience engineering, means acknowledging that "things will go wrong" and focusing on how to "minimize impact and make sure we can evolve to meet the moment." The hidden consequence of this prevention-first approach is a system that appears secure but is, in reality, fragile, unprepared for the "bizarre ways that they can fall apart" that reality consistently presents.

Consider the analogy of an airplane’s coffee maker. This seemingly innocuous appliance, in a highly specific scenario, could trigger a critical failure. This isn't a failure of prevention in the traditional sense; the coffee maker wasn't designed to be an attack vector. Instead, it's a failure of foresight, an example of how intricate systems can experience cascading failures from unexpected sources. Shortridge points out that "there are just so many examples of your best intentions. Reality is stranger than fiction." This highlights a core tenet of systems thinking: the interconnectedness of components means that even the most secure elements can be compromised by unforeseen interactions. The focus on prevention often overlooks these complex, emergent failure modes, leading to a false sense of security.

"First, there's no such thing as a perfectly secure system. I think there are so many esoteric failures out there."

-- Kelly Shortridge

The implication for software development is profound. Instead of chasing an unattainable perfect state, teams should embrace the concept of designing for failure. This means building systems that can gracefully degrade, recover quickly, and adapt to new conditions. The alternative is a system that, while perhaps passing every compliance check, buckles under the first unexpected stressor. This is where the true cost of a prevention-only mindset becomes apparent: it’s not just about the immediate security breach, but the long-term inability to adapt and survive.

The Metrics Trap: When Confidence Becomes Complacency

A significant hurdle in achieving genuine resilience is the pervasive issue of "metrics theater." Security teams, often under pressure to demonstrate progress, can fall into the trap of tracking metrics that provide a superficial sense of control rather than genuine insight into system robustness. Shortridge identifies this as a key differentiator between teams that think they are resilient and those that are. The giveaway? "When the security team feels a sense of control, it probably means that they don't have a lot of resilience, because part of resilience is embracing the fact that there will be things well outside of your control." This embrace of the uncontrollable is a hallmark of systems thinking, recognizing that complex environments are inherently unpredictable.

The "cyber squirrel" anecdote, where squirrels caused widespread power grid failures, serves as a potent reminder of how simple, non-malicious actors can wreak havoc on critical infrastructure. These are not systems designed with malicious intent in mind, but they fail because they are not engineered to withstand unexpected physical intrusions. Similarly, in software, focusing solely on known threats and vulnerabilities leaves systems exposed to the "cyber squirrels" of the digital world -- the unexpected bugs, the emergent behaviors, the novel attack vectors.

"If you were trying to control everything and make things as deterministic as possible, you have already failed in my view, because the world is not deterministic."

-- Kelly Shortridge

The danger of metrics theater is that it masks underlying fragility. A dashboard showing a high percentage of vulnerabilities patched might obscure the fact that the most critical systems are still vulnerable to novel attacks, or that the patching process itself introduces new, unforeseen complexities. This creates a dangerous feedback loop: teams feel good about their metrics, stop probing for deeper issues, and become increasingly brittle. The true advantage, the "delayed payoff," comes from teams that resist this urge, that continue to probe and test, and that build resilience through continuous, often uncomfortable, experimentation.

The Software Advantage: Simulation as a Strategic Weapon

While physical systems face inherent limitations in testing and simulation, software offers a unique and largely untapped advantage: the ability to replicate and test complex environments with relative ease. Shortridge highlights this disparity, noting that while "you can't replicate a realistic clone of New York City to see if there's a certain level of trash blocking sewer drains," you can do precisely that in the digital realm. This capability is not merely a theoretical nicety; it's a strategic imperative for building resilient systems.

The reluctance to leverage this capability is a critical failure. Teams often shy away from "chaos engineering" experiments, fearing the disruption or the perceived complexity. However, Shortridge argues for starting with smaller, lower-impact experiments. Testing basic assumptions, like whether a login page still works without specific headers, can be done with minimal risk, especially when requests can be duplicated. This contrasts sharply with the "RMF the customer database" scenario, which is obviously a high-consequence experiment. The key is to find the spectrum of testing and to begin somewhere.

"My favorite leveraging, actually Fastly's Compute, which is kind of like a high-performance serverless, you can think of it that way. It's just a little function that strips out cookies just to see like, 'Hey, does your login site work?'"

-- Kelly Shortridge

This deliberate, controlled introduction of chaos is not about creating disorder for its own sake, but about understanding the system's boundaries and failure modes before they are exposed by real-world events. The "competitive advantage" here is forged in the proactive identification and mitigation of weaknesses. Teams that invest in this kind of simulation are building a deeper understanding of their systems, leading to more robust designs and faster recovery times when incidents inevitably occur. This is where the "delayed payoff" truly manifests -- in a system that is not just functional, but fundamentally resilient.

Compliance vs. Security: The Box-Checking Trap

A recurring theme is the tension between traditional compliance frameworks and actual security outcomes. Shortridge and others argue that well-intentioned regulations can ossify practices, leading to a situation where "being more compliant doesn't actually result in better security outcomes." The example of auditors insisting on SSH access, even when a security team has deemed it too risky and has removed it, perfectly illustrates this disconnect. The checklist, the "checkbox," becomes the goal, rather than the underlying security posture.

This creates a scenario where teams are "tying them to their investments and their spend to just checking those boxes, which is a disservice to their overall mission." The problem is compounded by the fact that these checklists are often based on outdated assumptions, remnants of a "status quo that no longer exists." The system evolves, but the compliance framework remains static, eroding resilience over time.

The implication is that security leaders must actively challenge these rigid frameworks. Shortridge advocates for a mindset of asking "But why?" -- probing the rationale behind compliance requirements and pushing back when they conflict with actual security best practices. This requires a level of "ignorance" as a superpower, as described by someone working at Microsoft, where a fresh perspective can question long-held, unquestioned assumptions.

"Well-intentioned regulation in this space very quickly calcifies and ossifies. Like, what helped in year zero through maybe even year three may end up actually eroding resilience long term."

-- Kelly Shortridge

The danger here is that compliance becomes a substitute for genuine security thinking. Teams focus on passing audits rather than on understanding and mitigating real-world risks. This leads to a false sense of security, where systems are technically "compliant" but remain vulnerable to sophisticated attacks or unexpected failures. The true advantage lies with organizations that can navigate this tension, prioritizing actual security outcomes while strategically meeting compliance obligations, rather than letting compliance dictate their security posture.

Actionable Takeaways for Building Resilience

Embrace Failure as a Design Principle: Shift from a purely preventative security mindset to one that actively designs for failure and recovery. Understand that systems will fail, and focus on minimizing impact and enabling rapid evolution.
Leverage Software's Simulation Advantage: Invest in chaos engineering and other simulation techniques. Start with small, low-risk experiments to test assumptions and understand system boundaries. This requires a cultural shift to embrace controlled disruption.
Question Compliance Dogma: Critically evaluate compliance requirements. Ask "why" and push back when rigid adherence to checklists hinders actual security outcomes. Prioritize security effectiveness over mere compliance.
Combat Metrics Theater: Identify and discard metrics that provide a false sense of security. Focus on metrics that genuinely reflect resilience, recovery time, and system adaptability, even if they are less "sexy" or harder to measure.
Foster Cross-Functional Collaboration: Encourage security teams to collaborate with platform engineering, SREs, and even "attackers" (internal red teams) to identify blind spots and develop more robust experiments.
Own Your Dependencies: Understand that adopting third-party services or libraries means accepting responsibility for their potential failures. Conduct thorough due diligence and build resilience into your integration points.
Develop "Good Taste" Through Practice: For early-career professionals, focus on gaining hands-on experience by "pulling wires" and experimenting. This builds the judgment and intuition necessary for effective resilience engineering, which cannot be solely outsourced to automation or LLMs. This pays off in the long term by developing deeply capable engineers.

Related Episodes

AI Security Requires System-Level Defense and Radical Transparency

Dec 16, 2025 Latent Space: The AI Engineer Podcast

AI guardrails are security theater, sacrificing capability without enhancing safety. True AI security requires system-level defenses and radical transparency, not model lobotomization.

View Episode Notes →

AI Agents Expose Security Risks and Prompt New Interface Paradigms

Apr 24, 2026 The Daily AI Show

AI's complexity is escalating, creating new security risks and demanding a strategic approach. Understand how these tools reshape interactions and unlock competitive advantages.

View Episode Notes →

Enterprise AI Adoption Challenges: Trust, ROI, and Job Market Transformation

Dec 16, 2025 The Stack Overflow Podcast

AI transforms jobs and physical spaces, yet enterprise trust and ROI remain key hurdles. Learn to navigate AI's unpredictable nature and ensure responsible, valuable adoption.

View Episode Notes →

AI Control Tower: Integrating Platforms and Human Skills for Business Reinvention

Apr 17, 2026 No Priors: Artificial Intelligence | Technology | Startups

AI's true value lies in integrating with existing systems, not just adopting new tools. Discover how to avoid costly AI replication and build resilient, future-proof organizations.

View Episode Notes →

Friction, Gratification, and Data Shape Digital Lives Through AI

Apr 23, 2026 Intelligent Machines (Audio)

True joy in technology comes from sensory pleasures and constraints, not frictionlessness. AI can reveal new rough edges, inviting deeper engagement with the physical world.

View Episode Notes →

Internet's Insecurity: AI Agents' Containment Bottleneck

Mar 19, 2026 AI + a16z

AI agents offer powerful automation, but the internet's human-centric security is a critical bottleneck. Securely containing these agents, not just building them, is the real challenge for adoption.

View Episode Notes →