RPO Best Practices for Azure Disaster Recovery
Recovery Point Objective (RPO) defines how much data loss is acceptable in a disaster: if your RPO is 4 hours, losing up to 4 hours of data is within the plan. If your actual recovery returns you to a state that is 8 hours old, you have missed your RPO.
Most Azure DR deployments set an RPO based on a business requirement, then implement a backup schedule and assume the two are aligned. They often are not. A backup that runs every 4 hours with a recovery window of 15 minutes to complete does not guarantee a 4-hour RPO: if a failure occurs 3 minutes before a scheduled backup, you might recover to 3 hours 57 minutes before the failure, which is within RPO. If the backup itself fails and you fall back to the previous cycle, your actual RPO could be nearly 8 hours. This guide covers how to design for your RPO requirement, not just target it.
Understand what drives RPO in Azure
RPO is determined by the replication frequency and method of your data layer. The compute layer (VMs, containers, app services) is stateless from a DR perspective: what matters is the last-known-good state of your data.
Azure Backup for VMs creates recovery points on a policy schedule: typically once or twice daily for standard backup, up to four times daily for Enhanced Policy on supported VM types. The minimum RPO achievable with Azure Backup for VMs is 4-6 hours (using the 4x daily backup frequency with Enhanced Policy). For shorter RPO, Backup alone is insufficient.
Azure Site Recovery for VM replication maintains continuous replication of VM changes to a recovery region, with recovery points created every 5 minutes (crash-consistent) or up to every 4 hours (application-consistent). Recovery points are retained per your configuration (up to 24 hours by default). The minimum RPO achievable with ASR for Azure-to-Azure replication is approximately 5 minutes. For most regulated workloads, ASR is the right mechanism for RPO requirements below 1 hour.
Azure SQL Database and Cosmos DB handle replication differently from VMs: they replicate the data store directly rather than the VM disk.
SQL Database automated backups run every 12-15 minutes for transaction log backups on Standard and Business Critical tiers, providing point-in-time restore within a 1-5 minute RPO for most workloads. SQL Database auto-failover groups with secondary read replicas provide near-zero data loss on failover: the Business Critical tier uses synchronous replication.
Cosmos DB with multi-region writes provides a configurable consistency level. Strong consistency guarantees zero data loss on failover. Bounded staleness allows configurable lag (minimum 1 minute, minimum 100,000 operations).
Matching RPO requirements to replication mechanisms
| RPO requirement | Recommended mechanism | Service |
|---|---|---|
| Sub-second to 5 minutes | Synchronous replication | SQL Business Critical auto-failover groups, Cosmos DB multi-region |
| 5-15 minutes | Continuous replication | Azure Site Recovery (VM), SQL auto-failover groups |
| 15 minutes to 1 hour | Continuous replication and snapshots | ASR plus application-consistent snapshots |
| 1-4 hours | Frequent backup | Azure Backup Enhanced Policy (4x daily), SQL point-in-time restore |
| 4-24 hours | Standard backup | Azure Backup standard policy, Azure Database standard backup |
Apply the correct mechanism to each tier of your application independently. A web tier might tolerate a 4-hour RPO (stateless compute; only configuration matters), while the database tier requires a 5-minute RPO (all business data lives here). Design replication at the data layer, not the whole-application layer.
Application-consistent vs crash-consistent recovery points
ASR creates two types of recovery points:
Crash-consistent (every 5 minutes): captured without coordinating with the application. Equivalent to pulling the power cord and reading the disk. The data is consistent at the block level but application-level transactions may be in mid-flight. SQL Server will run crash recovery on a crash-consistent point, which resolves incomplete transactions but may result in some committed work being lost.
Application-consistent (configurable, typically every 1-4 hours): captured using VSS (Volume Shadow Copy Service) on Windows or custom scripts on Linux to quiesce the application before snapshotting. All in-flight transactions are committed or rolled back before the snapshot. Recovery from an application-consistent point requires no additional recovery steps.
For databases, always verify that the application-consistent snapshot frequency aligns with your RPO. An ASR configuration with crash-consistent recovery every 5 minutes and application-consistent every 4 hours provides a 5-minute RPO for crash-consistent points, but your actual clean recovery point may be up to 4 hours old. For databases that require clean recovery, the application-consistent frequency is the effective RPO.
Test your actual RPO, not your theoretical one
The only way to know your actual RPO is to run a test failover and measure the recovery point age. ASR test failovers can be run without affecting production. Check the recovery point age at the time of test failover against your RPO target.
Measure RPO over time, not just at the moment of the first test. Replication lag varies with I/O load on the protected workload. A VM generating high write I/O (a busy SQL Server, a log-heavy application) will show more replication lag than the same VM at idle. Test during representative load conditions.
Track the RPO Health metric in Azure Monitor for ASR-protected items. This metric shows the current lag between the latest recovery point and the current time. Alert when RPO Health exceeds your target threshold: a sustained breach indicates replication is falling behind and your actual RPO is worse than planned.
AzureDiagnostics
| where ResourceType == "VAULTS"
| where Category == "AzureSiteRecoveryReplicatedItems"
| extend rpoMinutes = todouble(rpoInSeconds_d) / 60
| where rpoMinutes > 15
| summarize max(rpoMinutes) by ReplicatedItemFriendlyName_s, bin(TimeGenerated, 5m)
Replace 15 with your RPO target in minutes.
RPO for regulated businesses: evidence requirements
Under DORA and FCA operational resilience requirements, financial services firms must demonstrate that their impact tolerances (including data loss tolerances) are met and tested. Simply setting an RPO in a DR policy document is not sufficient: you need:
- Evidence that the replication mechanism achieves the target RPO under load (test failover records with recovery point age documented)
- Monitoring that shows RPO health is maintained continuously, not just at test time
- A process for responding when RPO health degrades (replication falling behind)
- Annual or more frequent DR exercise records demonstrating the RPO was met
Build these requirements into your DR governance before an audit asks for them, not after.
Where Critical Cloud comes in
RPO compliance is a technical, operational, and governance problem simultaneously. We design Azure DR architectures with the correct replication mechanisms per workload tier, run scheduled failover exercises, and produce the evidence documentation that satisfies FCA, DORA, and PCI DSS requirements. As the world's first Powered by Datadog accredited partner, we monitor replication lag and RPO health as live operational signals, so a degrading RPO is an alert, not a post-incident discovery. See how Critical Support works.