Backup & Disaster Recovery

RPO / RTO against ground truth, backup restoration tests, failover architecture, runbook quality, and incident-response readiness. Because you do not have backups until you have restored from them.

Audit category · 06 of 08

Scope

What this audit covers

Backup and DR programs accumulate entropy quietly. Everything looks fine on a green dashboard until the day a restoration is actually needed and the retention window was shorter than anyone remembered, or the replica was missing the last six months of schema changes. This audit is designed to find those problems before a real incident does.

RPO / RTO mapping

Recovery Point and Recovery Time objectives declared versus delivered — measured, not assumed. Per-service, per-tier, per-region.

Backup inventory

What is backed up, what is not, retention policies, cross-region replication, immutability / object-lock status, ransomware isolation.

Restoration testing

Actual test restores from actual backups. Timed. Validated. The only metric that matters.

Failover architecture

Warm / cold / hot standby posture per service, traffic-shift mechanics, DNS / global-accelerator / Front Door readiness, cross-region data parity.

Runbooks & IR

Documented DR runbooks, tabletop and live-exercise history, pager coverage, escalation trees, legal / PR notification paths.

Data-loss risk

Snapshots with short retention, single-region-only backups, unencrypted backups, backups in the same account as prod (a single compromise deletes both).

Why it matters

Untested backups are not backups

Every major cloud outage and every ransomware write-up repeats the same lesson: the organizations that recover are the ones that practiced, and the ones that practiced had the gaps surfaced ahead of time. Declared RPO and RTO targets that have never been validated are aspirational numbers on a slide.

SOC 2 Availability, ISO 27001 A.12.3 / A.17, HIPAA contingency-plan requirements, and essentially every enterprise buyer's security questionnaire now explicitly ask whether DR has been tested. The audit both closes the operational risk and generates the evidence.

Method

How we assess it

Four work streams, ordered so findings compound.

Stream 1

Business impact analysis

Map services to criticality tiers. Every tier gets an RPO / RTO target agreed with business owners, not inherited from engineering defaults.

Stream 2

Backup posture review

Inventory every backup policy, snapshot schedule, replica pair, and cross-region copy. Validate against declared RPOs.

Stream 3

Restoration test

Pick representative workloads. Restore to a clean environment. Measure time. Validate data integrity. Document failures.

Stream 4

DR exercise

Tabletop or live (your choice). Run a realistic scenario — region failure, ransomware, account compromise — and observe how the team, the runbooks, and the infrastructure respond.

Deliverables

What you get

RPO / RTO matrix — declared versus delivered, per service, with the evidence behind the delivered numbers.
Backup posture report — coverage gaps, retention mismatches, immutability and isolation posture, ransomware-resilience rating.
Restoration-test results — timed runs, data-integrity validation, failure modes encountered, remediation per finding.
DR runbook review — gap list, recommended structure, example runbooks for the top three scenarios if none exist.
Exercise report — tabletop or live-exercise findings, people / process / technology gaps separated, after-action recommendations.
Executive summary — one page. The three investments with the largest resilience lift.

Patterns

Common findings

Backups exist but have never been restored.

Snapshots are taken. Lifecycle policies run. Nobody has actually pulled one back in twelve months. First restore in anger is a terrible time to discover that the restored EBS volume will not boot or the database dump is corrupt.

Backups live in the same account as production.

A compromised production account with admin-equivalent credentials can delete the backups too. Backups belong in a separate account, ideally a separate cloud organization, with object-lock / immutability enabled.

RPO claimed at minutes, delivered in hours.

Customers are told the RPO is fifteen minutes. The replica lag dashboard shows the replica is typically four hours behind. The declared number is aspirational.

DR region tested once during launch, never since.

Original failover exercise passed, confidence declared, configuration drifted. Services added since have no counterparts in the DR region. Quarterly live exercises catch this; annual is the minimum.

Runbooks assume knowledge only the original author had.

“Then run the failover script” — no link, no path, no discussion of what happens if it fails. Runbooks rewritten for a pager-response engineer at 3 a.m., not a domain expert at 2 p.m.

No separation between backup and ransomware recovery.

Recovery from an accidental delete and recovery from a destructive compromise are not the same problem. The former needs speed; the latter needs immutable, isolated, and verifiably clean copies.

FAQ

Questions we get asked

Does the restoration test affect production? +

No. Restores are performed into an isolated recovery environment with its own accounts, networks, and IAM. Production remains unchanged.

Live DR exercise or tabletop? +

Both are valuable; they surface different kinds of gaps. Most clients do a tabletop first and schedule a live exercise for the following quarter once the runbook gaps found in the tabletop are fixed.

What RPO / RTO is realistic? +

Depends on the service tier, the architecture, and the budget. For most SaaS products the realistic ranges are RPO 5–60 minutes (with synchronous replication or frequent snapshots) and RTO 1–4 hours (with pre-provisioned warm standby). Numbers below that are available but cost more than most revenue curves justify.

How does this satisfy our SOC 2 / ISO 27001 / HIPAA requirements? +

SOC 2 Availability, ISO 27001 A.12.3 and A.17, and HIPAA 164.308(a)(7) all require documented BCP / DR with periodic testing. The audit package is structured to be handed to an auditor as evidence directly.

Can you help us design a DR strategy from scratch? +

Yes. We run a separate Resilience Design engagement for companies that want DR-as-a-program rather than DR-as-a-finding list. This audit is usually the prerequisite.

Start with a free Cloud Health Check.

A scoped-down CloudCheck 360° of your current environment. Delivered in five business days, no commitment.

Book a Health Check See engagement packages