Category: Operational Resilience Author: Cody Swidler Tags: operational resilience, DORA, FFIEC, ownership, program design, BC/DR

Every major regulatory framework touching operational resilience — DORA, FFIEC, the SEC's business continuity guidance, FCA's SYSC rules — says roughly the same thing: know your important business services, understand what threatens them, prove you can recover.

That's not complicated guidance. Most organizations have been aware of these expectations for years.

And yet, when a real disruption hits — a cloud outage, a ransomware incident, a critical third-party failure — a surprising number of programs that looked solid on paper fall apart in practice. Recovery times are missed. Escalation paths are unclear. The playbook assumes resources that aren't available. Leadership gets inconsistent information.

The frameworks aren't the problem. The ownership is.

What Breaks During a Real Incident

I've run resilience programs through actual activations — not just tabletop exercises, but real incidents affecting real operations. The failures are almost never technical. They're organizational.

The most common ones:

Nobody owns the decision. Plans define roles, but in a live incident, roles get fuzzy. Who declares a crisis? Who authorizes a failover? Who communicates with customers? When those decisions aren't pre-assigned — with specific names, not job titles — they stall. And in a time-sensitive recovery, stalling has a cost.

The plan was written by the continuity team but not by the people doing the work. BIA outputs and recovery procedures that were built top-down, without deep input from the engineers and operators who actually run the systems, are aspirational documents. They reflect what someone thought the recovery would look like — not what it actually takes.

Testing validated process, not capability. A tabletop that walks through a scenario and concludes "the plan worked" is almost always a false positive. You haven't tested whether your teams can execute under pressure, whether your recovery tooling actually works at scale, or whether your RTO assumptions hold up against a real failure mode. Process validation is not capability validation.

Third parties are treated as out of scope. Most material outages at technology companies involve a vendor or cloud provider at some point in the chain. If your resilience program doesn't have an honest picture of your concentration risk, your single points of failure, and your contractual recovery obligations with critical vendors, your program has a blind spot in the most likely place a real incident will originate.

Resilience Needs an Owner, Not a Committee

The organizational design question matters more than people want to admit. Resilience programs that are "owned" by a committee, a steering group, or a shared accountability model tend to be programs where nobody is actually accountable.

In practice, effective resilience programs have someone whose job it is to push — to schedule the tests, challenge the recovery assumptions, escalate when critical systems lack adequate plans, and tell engineering leadership when their architecture choices are creating unacceptable recovery exposure. That person needs authority, not just influence.

At organizations where I've built these programs, the most important structural change was almost always getting resilience into engineering conversations early — at the architecture review stage, before the system is built — rather than retrofitting continuity requirements onto systems that were never designed for them.

The SRE Connection

Site Reliability Engineering and operational resilience are solving the same problem from different angles. SRE focuses on availability, fault tolerance, and incident response. Resilience focuses on recovery, continuity, and regulatory obligations. They belong in the same conversation.

When resilience teams and SRE teams are aligned — sharing runbooks, conducting joint chaos experiments, agreeing on RTO/RPO targets that are validated through actual testing rather than assumption — the result is a program that works because it's grounded in how the systems actually behave, not how the documentation says they should.

What Regulators Are Actually Looking For

Whether you're operating under DORA, FFIEC, or SEC guidance, regulators aren't looking for perfect documentation. They're looking for evidence that your organization understands its dependencies, has tested its ability to recover from plausible scenarios, and has governance structures that ensure resilience is a real operational priority — not a compliance exercise.

That means testing programs with documented results. It means showing that findings from tests drive actual improvements. It means being able to articulate your important business services, your recovery objectives, and your current state of preparedness with specificity.

Organizations that treat resilience as an ownership problem — who owns this, who tests it, who is accountable for the outcome — tend to do well in those conversations. Organizations that treat it as a documentation problem tend to find out the hard way that the documentation doesn't hold up when someone looks closely.

Cody Swidler is the founder of PivotRisk and a Principal Program Manager, Enterprise Resiliency at Apex Clearing. He has designed and operated resilience programs at Microsoft, Twilio, Box, Zayo, and Miro.

Cody Swidler is the founder of PivotRisk and a Principal Program Manager, Enterprise Resiliency at Apex Clearing. He has designed and operated resilience programs at Microsoft, Twilio, Box, Zayo, and Miro.

Prove the ownership model actually works

The Tabletop Exercise Playbook puts your roles, escalation paths, and decision rights under realistic pressure — before a real incident does it for you.

Get the Tabletop Playbook