Business Continuity Operations
Business continuity operations is the discipline of keeping critical business processes running through disruption โ cyberattack, natural disaster, supplier failure, pandemic, power outage. The deliverables are a Business Impact Analysis (which processes can the business survive losing, and for how long?), a Recovery Time Objective (RTO โ how fast must each process be restored?), a Recovery Point Objective (RPO โ how much data loss is acceptable?), and a tested playbook that names humans, systems, alternate sites, and communication channels. KnowMBA POV: business continuity plans that have never been tested under realistic conditions are documents, not capabilities. The only proof is a live exercise where the primary system is unavailable and the recovery actually works inside the stated RTO.
The Trap
The trap is treating BCM as a compliance binder maintained by a junior risk analyst, refreshed annually for the auditor and never exercised. When the real incident arrives, the call tree is stale, the alternate site contract has lapsed, the backup tapes have not been tested for restore in 18 months, and the people named in the plan have left the company. The other trap: optimizing for the disaster you have already imagined (fire in HQ, snowstorm) and being unprepared for the disaster you have not (ransomware that encrypts your backups, multi-region cloud outage, a supplier whose own continuity plan failed).
What to Do
Run a Business Impact Analysis to rank every process by Maximum Tolerable Downtime. For each Tier-1 process (revenue-critical, safety-critical, regulatory-critical), define RTO and RPO, document the recovery runbook, and pre-position the resources (alternate site, hot standby, manual workaround). Then test annually with at least one full-scale exercise per Tier-1 process โ primary system off, clock running, executives in the room. Track Mean Time To Recover from each exercise and shrink it.
Formula
In Practice
When NotPetya malware tore through Maersk in June 2017, every Windows machine on the corporate network โ 49,000 laptops, 4,000 servers, 2,500 applications โ was wiped within seven minutes. The company moved 20% of global container shipping. Recovery worked only because a single domain controller in Ghana had been offline during a power outage and held an uninfected copy of Active Directory; a team flew the disk to the UK and rebuilt the directory from it. Manual processes (paper, WhatsApp, personal email) kept ports operating for ten days. Maersk later disclosed the incident cost ~$300M and rebuilt its IT infrastructure with a continuity-first design.
Pro Tips
- 01
Test the test. Most BCP exercises are choreographed โ participants know the scenario, the system is restored under controlled conditions, and the result is a green box on a slide. A real exercise hides the scenario, allows the recovery to fail, and measures actual time-to-recover honestly.
- 02
Backups you have not test-restored within the last 90 days are theoretical backups. The 2017 Maersk recovery worked because of a single accidental offline DC, not because of a planned air-gapped backup; most companies are not that lucky.
- 03
Name a single Incident Commander with pre-delegated authority for each Tier-1 scenario. Decision-by-committee during an outage costs hours; the IC's job is to make calls and document them, not to seek consensus.
Myth vs Reality
Myth
โISO 22301 certification means we are resilientโ
Reality
ISO 22301 certifies that you have a documented BCMS, not that you can actually recover. Maersk had a mature BCM program before NotPetya. The certification audits paperwork; the disruption audits capability. Treat the certification as the floor, not the ceiling.
Myth
โCloud providers handle continuity for usโ
Reality
Cloud SLAs cover the provider's infrastructure, not your application. AWS us-east-1 outages (December 2021, June 2023) took down customers who had assumed multi-AZ was sufficient. True continuity requires multi-region or multi-cloud architecture, tested failover, and an operational runbook your on-call can execute at 3am โ none of which the cloud provider supplies by default.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
Your company sets an RTO of 4 hours for the order management system. The last full failover test took 11 hours and required a vendor on-call. What is the right next action?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
Tier-1 Process Recovery Time Objective
Common RTO targets across industries; regulator expectations for financial services and healthcare are stricterMission-critical (payments, trading)
< 1 hour
Customer-facing (e-commerce, support)
1-4 hours
Internal operational systems
4-24 hours
Back-office / batch
1-3 days
Source: Disaster Recovery Institute International (DRII) and ISO 22301 practitioner benchmarks
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Maersk
2017
On 27 June 2017 the NotPetya wiper malware spread from a Ukrainian tax-software update into Maersk's network and erased 49,000 Windows endpoints, 4,000 servers and 2,500 applications in roughly seven minutes. Active Directory was destroyed everywhere except a single domain controller in Ghana that had been offline during a local power outage. A team retrieved the disk and rebuilt AD from it. Container terminals ran on paper, WhatsApp and personal email for ten days. Total cost was disclosed at about $300M; Maersk subsequently rebuilt its IT estate with continuity-first architecture (immutable backups, segmented networks, scripted restore).
Endpoints wiped in 7 minutes
49,000
Disclosed cost of incident
~$300M
Manual operations period
~10 days
Recovery enabled by
1 accidentally offline domain controller
Recovery worked through luck, not design. The post-NotPetya rebuild โ air-gapped backups, segmented networks, scripted restore โ is the design that should have existed before the incident.
Hypothetical: Mid-sized US Hospital System
Composite, 2022-2024
A 6-hospital regional system suffers a ransomware attack that encrypts the EHR and most clinical systems. The published BCP had RTOs of 2-4 hours; actual restore takes 12 days because online backups were also encrypted and the offline tape rotation had failed silently for 7 weeks. Diversion of ambulances, cancelled elective procedures, and overtime staffing cost $40-60M before counting regulatory penalties and class actions.
Stated RTO (clinical systems)
2-4 hours
Actual recovery time
12 days
Estimated direct cost
$40-60M
Backup tape rotation failure undetected for
7 weeks
Untested backups are a future incident. Quarterly restore-tests with documented evidence are the only credible proof; backup completion alerts are not.
Decision scenario
The Untested BCP
You are the new CRO of a $1.5B specialty insurer. Your predecessor maintained an ISO 22301-certified BCP. In your first week, you discover the last full failover test of the policy administration system was 26 months ago and took 18 hours against a stated 4-hour RTO. The CIO says a real test would 'put the quarter at risk' and recommends another tabletop exercise instead.
Stated RTO (policy admin)
4 hours
Last actual recovery time
18 hours
Months since last live test
26
Daily premium written
$4M
Decision 1
The CIO offers a tabletop exercise as the alternative. The CFO wants to avoid quarter-end disruption. The board's audit committee chair has just asked โ informally โ whether the BCP would survive a real ransomware event.
Run the tabletop. It is lower-risk, satisfies the audit committee question, and the certification stays clean.Reveal
Schedule a live failover test in a low-volume window within 90 days, with a clear decision-rights sheet, an Incident Commander, and an agreement that a failed test is information, not a fireable offense. Pair it with capex for warm standby and automation to close the RTO gap.โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn Business Continuity Operations into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn Business Continuity Operations into a live operating decision.
Use Business Continuity Operations as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.