Resilience Engineering: Designing Systems for Failure Not Prevention

Original Title: The Joy of Unplugging Cables: Kelly Shortridge on Security Resilience

In this conversation, Kelly Shortridge, CPO at Fastly, challenges conventional security wisdom by advocating for resilience engineering and designing systems for failure rather than solely for prevention. The core thesis is that perfect security is an illusion; true robustness lies in how systems recover from inevitable disruptions. This perspective reveals hidden consequences: focusing solely on prevention breeds a false sense of control, making organizations brittle when unexpected events occur. The conversation highlights how traditional compliance can actively hinder security and how software's unique ability for simulation is vastly underutilized. This analysis is crucial for technical leaders, security practitioners, and anyone building complex systems who seeks to move beyond superficial metrics towards genuine, adaptable resilience.

The Illusion of Prevention: Embracing Failure as a Design Principle

The prevailing security mindset often centers on building impenetrable walls, a strategy that, as Kelly Shortridge argues, fundamentally misunderstands the nature of complex systems. The pursuit of perfect prevention is not only unattainable but actively detrimental, fostering a brittle confidence that crumbles under the weight of inevitable failures. This perspective shift, from prevention to resilience engineering, means acknowledging that "things will go wrong" and focusing on how to "minimize impact and make sure we can evolve to meet the moment." The hidden consequence of this prevention-first approach is a system that appears secure but is, in reality, fragile, unprepared for the "bizarre ways that they can fall apart" that reality consistently presents.

Consider the analogy of an airplane’s coffee maker. This seemingly innocuous appliance, in a highly specific scenario, could trigger a critical failure. This isn't a failure of prevention in the traditional sense; the coffee maker wasn't designed to be an attack vector. Instead, it's a failure of foresight, an example of how intricate systems can experience cascading failures from unexpected sources. Shortridge points out that "there are just so many examples of your best intentions. Reality is stranger than fiction." This highlights a core tenet of systems thinking: the interconnectedness of components means that even the most secure elements can be compromised by unforeseen interactions. The focus on prevention often overlooks these complex, emergent failure modes, leading to a false sense of security.

"First, there's no such thing as a perfectly secure system. I think there are so many esoteric failures out there."

-- Kelly Shortridge

The implication for software development is profound. Instead of chasing an unattainable perfect state, teams should embrace the concept of designing for failure. This means building systems that can gracefully degrade, recover quickly, and adapt to new conditions. The alternative is a system that, while perhaps passing every compliance check, buckles under the first unexpected stressor. This is where the true cost of a prevention-only mindset becomes apparent: it’s not just about the immediate security breach, but the long-term inability to adapt and survive.

The Metrics Trap: When Confidence Becomes Complacency

A significant hurdle in achieving genuine resilience is the pervasive issue of "metrics theater." Security teams, often under pressure to demonstrate progress, can fall into the trap of tracking metrics that provide a superficial sense of control rather than genuine insight into system robustness. Shortridge identifies this as a key differentiator between teams that think they are resilient and those that are. The giveaway? "When the security team feels a sense of control, it probably means that they don't have a lot of resilience, because part of resilience is embracing the fact that there will be things well outside of your control." This embrace of the uncontrollable is a hallmark of systems thinking, recognizing that complex environments are inherently unpredictable.

The "cyber squirrel" anecdote, where squirrels caused widespread power grid failures, serves as a potent reminder of how simple, non-malicious actors can wreak havoc on critical infrastructure. These are not systems designed with malicious intent in mind, but they fail because they are not engineered to withstand unexpected physical intrusions. Similarly, in software, focusing solely on known threats and vulnerabilities leaves systems exposed to the "cyber squirrels" of the digital world -- the unexpected bugs, the emergent behaviors, the novel attack vectors.

"If you were trying to control everything and make things as deterministic as possible, you have already failed in my view, because the world is not deterministic."

-- Kelly Shortridge

The danger of metrics theater is that it masks underlying fragility. A dashboard showing a high percentage of vulnerabilities patched might obscure the fact that the most critical systems are still vulnerable to novel attacks, or that the patching process itself introduces new, unforeseen complexities. This creates a dangerous feedback loop: teams feel good about their metrics, stop probing for deeper issues, and become increasingly brittle. The true advantage, the "delayed payoff," comes from teams that resist this urge, that continue to probe and test, and that build resilience through continuous, often uncomfortable, experimentation.

The Software Advantage: Simulation as a Strategic Weapon

While physical systems face inherent limitations in testing and simulation, software offers a unique and largely untapped advantage: the ability to replicate and test complex environments with relative ease. Shortridge highlights this disparity, noting that while "you can't replicate a realistic clone of New York City to see if there's a certain level of trash blocking sewer drains," you can do precisely that in the digital realm. This capability is not merely a theoretical nicety; it's a strategic imperative for building resilient systems.

The reluctance to leverage this capability is a critical failure. Teams often shy away from "chaos engineering" experiments, fearing the disruption or the perceived complexity. However, Shortridge argues for starting with smaller, lower-impact experiments. Testing basic assumptions, like whether a login page still works without specific headers, can be done with minimal risk, especially when requests can be duplicated. This contrasts sharply with the "RMF the customer database" scenario, which is obviously a high-consequence experiment. The key is to find the spectrum of testing and to begin somewhere.

"My favorite leveraging, actually Fastly's Compute, which is kind of like a high-performance serverless, you can think of it that way. It's just a little function that strips out cookies just to see like, 'Hey, does your login site work?'"

-- Kelly Shortridge

This deliberate, controlled introduction of chaos is not about creating disorder for its own sake, but about understanding the system's boundaries and failure modes before they are exposed by real-world events. The "competitive advantage" here is forged in the proactive identification and mitigation of weaknesses. Teams that invest in this kind of simulation are building a deeper understanding of their systems, leading to more robust designs and faster recovery times when incidents inevitably occur. This is where the "delayed payoff" truly manifests -- in a system that is not just functional, but fundamentally resilient.

Compliance vs. Security: The Box-Checking Trap

A recurring theme is the tension between traditional compliance frameworks and actual security outcomes. Shortridge and others argue that well-intentioned regulations can ossify practices, leading to a situation where "being more compliant doesn't actually result in better security outcomes." The example of auditors insisting on SSH access, even when a security team has deemed it too risky and has removed it, perfectly illustrates this disconnect. The checklist, the "checkbox," becomes the goal, rather than the underlying security posture.

This creates a scenario where teams are "tying them to their investments and their spend to just checking those boxes, which is a disservice to their overall mission." The problem is compounded by the fact that these checklists are often based on outdated assumptions, remnants of a "status quo that no longer exists." The system evolves, but the compliance framework remains static, eroding resilience over time.

The implication is that security leaders must actively challenge these rigid frameworks. Shortridge advocates for a mindset of asking "But why?" -- probing the rationale behind compliance requirements and pushing back when they conflict with actual security best practices. This requires a level of "ignorance" as a superpower, as described by someone working at Microsoft, where a fresh perspective can question long-held, unquestioned assumptions.

"Well-intentioned regulation in this space very quickly calcifies and ossifies. Like, what helped in year zero through maybe even year three may end up actually eroding resilience long term."

-- Kelly Shortridge

The danger here is that compliance becomes a substitute for genuine security thinking. Teams focus on passing audits rather than on understanding and mitigating real-world risks. This leads to a false sense of security, where systems are technically "compliant" but remain vulnerable to sophisticated attacks or unexpected failures. The true advantage lies with organizations that can navigate this tension, prioritizing actual security outcomes while strategically meeting compliance obligations, rather than letting compliance dictate their security posture.

Actionable Takeaways for Building Resilience

  • Embrace Failure as a Design Principle: Shift from a purely preventative security mindset to one that actively designs for failure and recovery. Understand that systems will fail, and focus on minimizing impact and enabling rapid evolution.
  • Leverage Software's Simulation Advantage: Invest in chaos engineering and other simulation techniques. Start with small, low-risk experiments to test assumptions and understand system boundaries. This requires a cultural shift to embrace controlled disruption.
  • Question Compliance Dogma: Critically evaluate compliance requirements. Ask "why" and push back when rigid adherence to checklists hinders actual security outcomes. Prioritize security effectiveness over mere compliance.
  • Combat Metrics Theater: Identify and discard metrics that provide a false sense of security. Focus on metrics that genuinely reflect resilience, recovery time, and system adaptability, even if they are less "sexy" or harder to measure.
  • Foster Cross-Functional Collaboration: Encourage security teams to collaborate with platform engineering, SREs, and even "attackers" (internal red teams) to identify blind spots and develop more robust experiments.
  • Own Your Dependencies: Understand that adopting third-party services or libraries means accepting responsibility for their potential failures. Conduct thorough due diligence and build resilience into your integration points.
  • Develop "Good Taste" Through Practice: For early-career professionals, focus on gaining hands-on experience by "pulling wires" and experimenting. This builds the judgment and intuition necessary for effective resilience engineering, which cannot be solely outsourced to automation or LLMs. This pays off in the long term by developing deeply capable engineers.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.