Azure Backup, Site Recovery and Disaster Recovery
A backup you have never restored is a hope, not a plan. The most expensive lesson in disaster recovery is learning, mid-incident, that the backups completed successfully for two years and none of them actually restore. Backup is the easy half. Recovery, tested and proven, is the half that counts.
This guide covers how to build recovery on Azure that works when you need it: the difference between backup and disaster recovery, how to set objectives that match the business, the Azure services that deliver them, and the discipline of testing that separates a real plan from a false sense of security.
Decide how much recovery you actually need
Not every workload deserves the same protection, and treating them as if they do wastes money on the trivial and under-protects the critical. Two numbers drive every decision:
RPO (Recovery Point Objective) is how much data you can afford to lose, measured in time. An RPO of one hour means you can tolerate losing up to an hour of data. It dictates how often you back up or replicate.
RTO (Recovery Time Objective) is how long you can afford to be down. An RTO of four hours means the service must be back within four hours of failure. It dictates the recovery architecture you need.
Set these per workload, driven by the business, not by IT preference. Getting RPO right for Azure disaster recovery is the foundation, because an RPO you cannot actually meet is just a number in a document. A tier-1 payments service and an internal reporting tool have very different objectives, and your architecture and spend should reflect that.
Backup is table stakes
Azure Backup protects your data: VMs, SQL databases, file shares, and on-premise workloads through the Recovery Services vault. The mechanics are well-solved. The discipline is in the parts teams skip.
Backups must alert on failure. A backup job that silently fails is worse than no backup, because it gives false confidence. Configuring Azure Backup failure alerts is the difference between knowing your recovery point is current and assuming it. Automate the routine through runbooks so backup scheduling and validation are consistent rather than manual, and lock down who can modify or delete backups. RBAC for Azure Backup matters more than it looks: in a ransomware scenario, the attacker's first target is often the backups, so the ability to delete them should be tightly held and ideally protected by soft delete and multi-user authorisation.
For regulated workloads, retention is a compliance requirement, not just an operational one. Retention policies have to match the mandated period for your sector, and you have to be able to evidence that they do.
Disaster recovery is more than backup
Backup restores data. Disaster recovery restores a running service. The difference is the gap between "we have the data" and "customers can use the service again," and for anything important that gap matters.
Azure Site Recovery handles this by replicating workloads to a secondary region and orchestrating failover. When the primary region or datacentre fails, you fail over to the replica and keep running. The failover patterns for high availability in Azure cover the architectural choices, from active-passive replication to fuller active-active designs, and which one you need follows directly from the RTO you set.
The decision between availability zones (protection within a region) and multi-region (protection against a whole region failing) is the same trade-off as everywhere in resilience: more protection, more cost, more complexity. Zones cover the common failure cases cheaply. Multi-region covers the rare catastrophic ones at higher cost. Match the choice to the risk and the RTO, not to a blanket policy.
Test it, or you do not have it
This is the section teams skip and regret. A disaster recovery plan that has never been tested is a document, not a capability.
Testing failover in Azure Site Recovery without disrupting production is built into the service: a test failover spins up the replica in an isolated network so you can validate that it actually comes up, the application actually works, and the RTO is actually achievable, all without touching the live service. Run it on a schedule. The first test almost always finds something: a missing dependency, a DNS assumption, a configuration that did not replicate. Far better to find it in a test than in an incident.
For regulated businesses, tested recovery is increasingly mandated. Under DORA and FCA operational resilience rules, you are expected to prove you can recover within tolerance through severe but plausible scenarios. A test failover with evidence is exactly what satisfies that, which turns DR testing from a good habit into a compliance output.
Operate it: monitoring and evidence
Recovery capability is not set-and-forget. Backups drift, replication lags, configurations change. You need continuous visibility that backups are completing, replication is healthy, and recovery points are current. That means monitoring backup job status, Site Recovery replication health, and RPO compliance as live signals with alerting, not as something you check after an incident.
For regulated workloads, that monitoring is also your evidence. Being able to show an auditor a continuous record that recovery points were met and DR tests were run and passed is what turns operational resilience from a claim into a proven fact.
Where Critical Cloud comes in
Building recovery that works, then proving it works on a schedule while operating the platform day to day, is what we do for businesses where downtime and data loss are not options. We run backup and disaster recovery for financial services, insurance, healthcare and public sector workloads, with tested failover and evidenced recovery built into the service. As the world's first Powered by Datadog accredited partner, we monitor backup health, replication status and RPO compliance as live signals, so a failed backup is an alert, not a discovery. If proving your recovery works is something your team never quite gets to, see how Critical Support works.