Making the Pivot to an SRE-Driven Organization

Category: Operational Resilience Author: Cody Swidler Tags: SRE, site reliability engineering, SLO, error budget, resilience, chaos engineering, reliability

If you're a risk or resilience leader walking into an organization where engineering runs on Site Reliability Engineering, you're going to feel friction fast. You'll ask for an annual DR test and get a confused look. You'll request an RTO and hear "we measure availability against an SLO." You'll propose a control and watch an engineer's eyes glaze over. The instinct is to assume the engineers don't take resilience seriously. The reality is usually the opposite: they take it more seriously than your framework does — they've just built a different, often better, system for it. The pivot is learning to work in their language instead of imposing yours.

What an SRE-Driven Organization Actually Believes

SRE, which came out of Google and is now widespread, treats reliability as an engineering discipline with hard numbers. A few of its core ideas reframe everything a resilience program cares about:

Service Level Objectives (SLOs) define the target reliability of a service in measurable terms — say, 99.95% availability over a rolling 28-day window. This is the RTO/RPO conversation, but continuous and quantified rather than an assumption written into a plan once a year.

Error budgets are the inverse of the SLO: the amount of unreliability you're allowed before you have to stop shipping features and fix stability. It's a risk appetite statement, expressed in a way engineering actually enforces against itself. That's a more operational risk appetite than most ERM programs ever achieve.

Toil reduction and automation mean SRE teams are biased against manual, repeatable work — including manual compliance work. A control that depends on someone remembering to do a thing every quarter is, to an SRE, a defect waiting to happen.

Blameless postmortems treat every incident as a systems-learning opportunity, not a search for who to punish. This is one of the most mature incident-learning cultures you'll encounter, and it's a gift to a resilience program — if you don't trample it with a compliance-flavored root-cause process.

Why Traditional GRC Collides With It

The friction is structural. Traditional resilience and GRC are periodic, document-centric, and point-in-time: the annual DR test, the policy attestation, the control reviewed once a year. SRE is continuous, telemetry-centric, and always-on. When you bring a point-in-time mindset into a continuous environment, you produce work that engineers correctly perceive as theater — a tabletop that simulates a failure their chaos experiments already trigger in production weekly, or a control attestation that captures a moment that was already stale by the time it was signed.

This is the same gap I've written about with automation in GRC: the value isn't in digitizing the annual ritual, it's in moving to continuous validation. SRE organizations already live there. Your job is to meet them, not drag them back.

How to Pivot

Map your requirements onto their instruments. Don't ask for an RTO — ask what the SLO is and whether it satisfies your recovery requirement. Don't ask whether they test recovery — ask to see the results of their last chaos experiment or game day. Most of what you need already exists as engineering telemetry; your job is to translate it into the risk and regulatory language your stakeholders need, not to make engineering generate a parallel set of artifacts.

Treat the error budget as risk appetite. When you can express resilience expectations in terms engineering already enforces, you stop being the function that slows them down and become the function that helps calibrate the budget. That's a fundamentally different relationship.

Validate capability, don't certify documents. The highest-value thing a resilience leader does in an SRE shop is design exercises that test the things automated experiments don't: cross-team coordination, decision-making under ambiguity, communication, the human side of incident command. Chaos engineering proves the system fails over; a well-run tabletop proves the people do. Those are complementary, and the second is where you add value an SLO dashboard can't. A structured tabletop exercise playbook is the right instrument for the human layer.

Automate your own controls. Nothing earns credibility with SREs faster than killing toil — including your own. If a control can be evidenced by a query against telemetry instead of a quarterly manual attestation, make that change. You're now reducing toil, which is a language they respect.

This Is an Operating Model Shift, Not a Vocabulary Swap

It would be easy to read the above as "use their words." It's deeper than that. Pivoting to an SRE-driven organization means rethinking where resilience work lives, who owns it, and how it flows — engineering owns continuous reliability, and your function shifts from doing resilience to governing it, validating the human layer, and translating between the telemetry and the regulators. That's an operating model change, and it's the same pattern I keep returning to: the structure has to match how the organization actually works, not how the framework assumes it works.

Where to Start

Sit with an SRE team and ask them to walk you through their SLOs, their last incident postmortem, and their chaos engineering practice. Don't bring a control checklist. You're there to learn the system they've already built — and you'll usually find that 70% of what your framework asks for already exists in a more rigorous form. Your value is in the missing 30%: the cross-functional coordination, the regulatory translation, the enterprise risk view, and the human-response validation that pure reliability engineering doesn't cover.

The risk leaders who thrive in SRE organizations stopped trying to make engineering conform to GRC. They made GRC fluent in engineering. That's the pivot.

Cody Swidler is the founder of PivotRisk and a Principal Program Manager, Enterprise Resiliency at Apex Clearing. He has built and scaled GRC, resilience, and risk programs across Microsoft, Twilio, Box, Zayo, and Miro.

Validate the human layer your telemetry can't

Chaos experiments prove the system fails over. The Tabletop Exercise Playbook proves your people do — testing coordination, decisions, and communication under realistic pressure.

Get the Tabletop Playbook