OperationsAdvanced8 min read

Business Continuity Operations

Business continuity operations is the discipline of keeping critical business processes running through disruption — cyberattack, natural disaster, supplier failure, pandemic, power outage. The deliverables are a Business Impact Analysis (which processes can the business survive losing, and for how long?), a Recovery Time Objective (RTO — how fast must each process be restored?), a Recovery Point Objective (RPO — how much data loss is acceptable?), and a tested playbook that names humans, systems, alternate sites, and communication channels. KnowMBA POV: business continuity plans that have never been tested under realistic conditions are documents, not capabilities. The only proof is a live exercise where the primary system is unavailable and the recovery actually works inside the stated RTO.

Also known asBCMBusiness Continuity ManagementOperational ResilienceDisaster Recovery OperationsBCP

Challenge a friend Browse library

The Trap

The trap is treating BCM as a compliance binder maintained by a junior risk analyst, refreshed annually for the auditor and never exercised. When the real incident arrives, the call tree is stale, the alternate site contract has lapsed, the backup tapes have not been tested for restore in 18 months, and the people named in the plan have left the company. The other trap: optimizing for the disaster you have already imagined (fire in HQ, snowstorm) and being unprepared for the disaster you have not (ransomware that encrypts your backups, multi-region cloud outage, a supplier whose own continuity plan failed).

What to Do

Run a Business Impact Analysis to rank every process by Maximum Tolerable Downtime. For each Tier-1 process (revenue-critical, safety-critical, regulatory-critical), define RTO and RPO, document the recovery runbook, and pre-position the resources (alternate site, hot standby, manual workaround). Then test annually with at least one full-scale exercise per Tier-1 process — primary system off, clock running, executives in the room. Track Mean Time To Recover from each exercise and shrink it.

Formula

Business Continuity Readiness Score = (Tier-1 Processes with Tested RTO Met / Total Tier-1 Processes) × 100%

In Practice

When NotPetya malware tore through Maersk in June 2017, every Windows machine on the corporate network — 49,000 laptops, 4,000 servers, 2,500 applications — was wiped within seven minutes. The company moved 20% of global container shipping. Recovery worked only because a single domain controller in Ghana had been offline during a power outage and held an uninfected copy of Active Directory; a team flew the disk to the UK and rebuilt the directory from it. Manual processes (paper, WhatsApp, personal email) kept ports operating for ten days. Maersk later disclosed the incident cost ~$300M and rebuilt its IT infrastructure with a continuity-first design.

Pro Tips

01
Test the test. Most BCP exercises are choreographed — participants know the scenario, the system is restored under controlled conditions, and the result is a green box on a slide. A real exercise hides the scenario, allows the recovery to fail, and measures actual time-to-recover honestly.
02
Backups you have not test-restored within the last 90 days are theoretical backups. The 2017 Maersk recovery worked because of a single accidental offline DC, not because of a planned air-gapped backup; most companies are not that lucky.
03
Name a single Incident Commander with pre-delegated authority for each Tier-1 scenario. Decision-by-committee during an outage costs hours; the IC's job is to make calls and document them, not to seek consensus.

Myth vs Reality

Myth

“ISO 22301 certification means we are resilient”

Reality

ISO 22301 certifies that you have a documented BCMS, not that you can actually recover. Maersk had a mature BCM program before NotPetya. The certification audits paperwork; the disruption audits capability. Treat the certification as the floor, not the ceiling.

Myth

“Cloud providers handle continuity for us”

Reality

Cloud SLAs cover the provider's infrastructure, not your application. AWS us-east-1 outages (December 2021, June 2023) took down customers who had assumed multi-AZ was sufficient. True continuity requires multi-region or multi-cloud architecture, tested failover, and an operational runbook your on-call can execute at 3am — none of which the cloud provider supplies by default.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Your company sets an RTO of 4 hours for the order management system. The last full failover test took 11 hours and required a vendor on-call. What is the right next action?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Tier-1 Process Recovery Time Objective

Common RTO targets across industries; regulator expectations for financial services and healthcare are stricter

Mission-critical (payments, trading)

< 1 hour

Customer-facing (e-commerce, support)

1-4 hours

Internal operational systems

4-24 hours

Back-office / batch

1-3 days

Source: Disaster Recovery Institute International (DRII) and ISO 22301 practitioner benchmarks

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

🚢

Maersk

2017

mixed

On 27 June 2017 the NotPetya wiper malware spread from a Ukrainian tax-software update into Maersk's network and erased 49,000 Windows endpoints, 4,000 servers and 2,500 applications in roughly seven minutes. Active Directory was destroyed everywhere except a single domain controller in Ghana that had been offline during a local power outage. A team retrieved the disk and rebuilt AD from it. Container terminals ran on paper, WhatsApp and personal email for ten days. Total cost was disclosed at about $300M; Maersk subsequently rebuilt its IT estate with continuity-first architecture (immutable backups, segmented networks, scripted restore).

Endpoints wiped in 7 minutes

49,000

Disclosed cost of incident

~$300M

Manual operations period

~10 days

Recovery enabled by

1 accidentally offline domain controller

Recovery worked through luck, not design. The post-NotPetya rebuild — air-gapped backups, segmented networks, scripted restore — is the design that should have existed before the incident.

Source ↗

🏥

Hypothetical: Mid-sized US Hospital System

Composite, 2022-2024

failure

A 6-hospital regional system suffers a ransomware attack that encrypts the EHR and most clinical systems. The published BCP had RTOs of 2-4 hours; actual restore takes 12 days because online backups were also encrypted and the offline tape rotation had failed silently for 7 weeks. Diversion of ambulances, cancelled elective procedures, and overtime staffing cost $40-60M before counting regulatory penalties and class actions.

Stated RTO (clinical systems)

2-4 hours

Actual recovery time

12 days

Estimated direct cost

$40-60M

Backup tape rotation failure undetected for

7 weeks

Untested backups are a future incident. Quarterly restore-tests with documented evidence are the only credible proof; backup completion alerts are not.

Decision scenario

The Untested BCP

You are the new CRO of a $1.5B specialty insurer. Your predecessor maintained an ISO 22301-certified BCP. In your first week, you discover the last full failover test of the policy administration system was 26 months ago and took 18 hours against a stated 4-hour RTO. The CIO says a real test would 'put the quarter at risk' and recommends another tabletop exercise instead.

Stated RTO (policy admin)

4 hours

Last actual recovery time

18 hours

Months since last live test

Daily premium written

$4M

Decision 1

The CIO offers a tabletop exercise as the alternative. The CFO wants to avoid quarter-end disruption. The board's audit committee chair has just asked — informally — whether the BCP would survive a real ransomware event.

Run the tabletop. It is lower-risk, satisfies the audit committee question, and the certification stays clean.Reveal

Eight months later, a ransomware incident hits. Recovery takes 6 days. Premium revenue, claims handling and broker-facing systems are all down. The audit committee chair's earlier question becomes the opening line of the post-mortem. The CIO and CRO both lose credibility; regulator opens an operational resilience review.

Actual recovery time when tested: Never tested → 6 days under attackRegulator action: Operational resilience review opened

Schedule a live failover test in a low-volume window within 90 days, with a clear decision-rights sheet, an Incident Commander, and an agreement that a failed test is information, not a fireable offense. Pair it with capex for warm standby and automation to close the RTO gap.Reveal

Correct. The first test fails (recovery in 14 hours, not 4) but the team learns exactly which steps consume time. Three months of automation work and a pre-positioned warm standby cut recovery to 3.5 hours by the next test. When the inevitable incident eventually arrives, the recovery is real, not theoretical. Audit committee, regulator and rating agencies all upgrade their assessment.

Recovery time after investment: 18hr → 3.5hr (RTO met)Rating agency view of operational resilience: Upgraded

Related concepts