There is a specific flavour of cloud outage that almost never shows up in a postmortem until the postmortem itself: the team has a documented backup strategy, a published RTO, and a DR plan signed by executives. The incident happens. The backups exist. The restore does not work. By the time anyone understands why, the window has closed.
Critical
An untested backup is a hypothesis. It becomes a backup the first time it is restored into a production-equivalent environment, against a production-equivalent workload, within the RTO you have committed to.
What “tested” actually requires
Most teams will say their backups are tested. What they mean, almost always, is one of the following:
- The backup job completes successfully.
- The backup file can be downloaded.
- The backup can be restored into a scratch environment.
None of these is a test. A real restore rehearsal has five properties:
- It starts from the detection of a simulated incident, not from a scheduled calendar event. The on-call rotation is paged. The responder finds the documentation, not the runbook author.
- The restore target is a production-equivalent environment, not a sandbox. Same VPC topology, same instance class, same IAM, same dependencies.
- The restore is timed end-to-end, from page to first successful write, against the stated RTO.
- Workload is driven against the restored environment until the restored data’s correctness is verified at the application layer, not just the database layer.
- A second person reviews the evidence, because the person doing the restore is the worst-positioned person to notice missing steps.
Miss any one of these and you have an exercise, not a rehearsal.
The three failure modes an untested restore hides
1. The backup exists, but the restore dependency doesn’t
RDS snapshots restore into a VPC. If the VPC no longer exists — decommissioned, CIDR reused, security groups renamed — the restore fails with a networking error that takes twenty minutes to diagnose under pressure. Twenty minutes is 30% of a one-hour RTO. This pattern generalizes: backups have environmental dependencies, and those dependencies drift.
2. The backup exists, but the access path to restore doesn’t
A common discovery: the IAM role that performs backups has write access, but no role in the account has the permissions to restore. Restore is a separate IAM action. During a real incident, you find this out when the first restore command returns an access-denied error from a user you didn’t expect to be using.
Common miss
Break-glass IAM roles for DR are often created, tested once, then left with a two-year-old key that gets rotated out of existence by an unrelated policy. The first time anyone tries to assume the role in a real incident is when they learn it’s been dead for six months.
3. The backup exists, but the application can’t use it
The database restored correctly. The schema is a version behind the application running in the standby region. Or the application’s encryption key is in the primary region’s KMS, which is the region that just failed. Or the Redis cache was assumed-persistent and wasn’t. Each of these has ended a real outage a day or more later than it should have.
What a good rehearsal cadence looks like
A tiered approach that matches incident severity:
- Monthly — single-service restore drills (one database, one object store). Scripted, automated, no human in the loop. Goal: catch regressions in backup integrity before they accumulate.
- Quarterly — full application restore into a DR region with a human responder, timed against RTO. Goal: exercise the runbook and the on-call chain, not just the technology.
- Annually — full business continuity drill including communications, customer notification, and stakeholder updates. Goal: surface the organizational gaps, which are always the biggest.
The quarterly drill is the one most teams skip and the one that pays for itself the fastest. The annual drill feels more important and almost always uncovers less, because it exercises things that drift slowly (leadership awareness, comms templates) rather than things that drift quickly (infrastructure, IAM, runbooks).
What to change in the next quarter
- Pick one production service. Schedule a full restore rehearsal into a production-equivalent environment, with the on-call responder running it. Time it end-to-end.
- Whatever the gap is between the measured RTO and the published RTO, treat it as the real number until you have closed it.
- Document the three things that slowed the restore down. Those are your DR backlog for the next quarter.
- Repeat with a different service. The failure modes are service-specific; a rehearsal on one service does not certify another.
The uncomfortable truth about DR programs is that the ones that work are the ones that have been tested against a real failure. You cannot schedule that. You can schedule the rehearsal that produces the same evidence, and that is the work.