How to Test Failover in Azure Site Recovery

An untested disaster recovery plan is a hypothesis. You believe it will work because the configuration looks correct. You do not know it will work until you run it. Azure Site Recovery's test failover capability runs a live validation of your DR configuration in an isolated network, with no impact to production replication or protected workloads. There is no excuse for not running it.

This guide covers how to execute a test failover correctly, what to validate during the test, and how to use the results as evidence for compliance frameworks.

How test failover works

ASR test failover creates a copy of the protected VM(s) in the recovery region using the selected recovery point. The copy is connected to a test virtual network (which you specify) that is isolated from production. Production replication continues unaffected throughout the test.

After the test, you validate the failed-over VMs (confirm they boot, the application starts, and the application state is as expected for the selected recovery point). Then you clean up the test: ASR deletes the test VMs and releases the resources. The test creates no persistent changes to production.

The key parameter is the recovery point you test against. Options: - Latest processed: Uses the most recently processed recovery point, which has the lowest RPO but may not be application-consistent. - Latest app-consistent: Uses the most recent application-consistent recovery point. This is the point most likely to result in a clean application start without manual recovery steps. - Custom recovery point: Choose a specific point in time for historical validation.

For most DR tests, use Latest app-consistent to validate the clean recovery scenario.

Pre-requisites before running a test

Verify replication health: Before running a test failover, confirm that protected items show Healthy replication status in the vault. A test failover using an unhealthy replication state validates a broken DR setup, not a working one.

Confirm the test network exists: You need an Azure virtual network in the recovery region designated for test failovers. This network should be isolated from production (no peering to production VNets) to ensure test VMs cannot communicate with production systems. Create a dedicated test VNet if one does not already exist.

Check recovery region capacity: Test failover provisions VMs in the recovery region. For large estates, confirm you have sufficient quota in the recovery region for the VM sizes used by the protected items being tested.

Prepare validation scripts or runbooks: Know in advance what you will test during the failover window. A test that proves the VMs boot but does not confirm the application works is an incomplete validation.

Run the test failover

In the Azure Portal, navigate to Recovery Services vault > Replicated Items. Select the VM or recovery plan to test.

For individual VMs: click Test Failover. Select the recovery point (Latest app-consistent for clean validation) and the test virtual network. Click OK.

For recovery plans (groups of VMs with defined failover order and scripts): select the recovery plan > Test Failover. Select recovery point and test VNet. The plan runs each group in order, executing any pre- and post-failover scripts defined in the plan.

ASR creates the test VMs in the recovery region. Monitor progress in the Jobs section of the vault. Test failover for a single VM typically completes in 5-15 minutes. A recovery plan with multiple VMs and scripts may take 30-60 minutes.

Validate during the test window

Once the test VMs are running, connect to them and validate:

VM boot and OS health: Confirm the VM has started successfully and the OS is responsive. RDP or SSH to the test VM to verify. Check the OS event log for errors related to the boot or the application startup.

Application start: Confirm the application process is running and has reached a healthy state. For a web application, confirm the health check endpoint returns 200. For a database, confirm it has started and completed recovery (no recovery mode on SQL Server, no crash recovery in progress on PostgreSQL).

Application functionality: Run a subset of functional tests against the application in the test environment. For a web application, confirm you can log in and execute key transactions. For a database, confirm the schema is intact and key queries return expected results.

Recovery point validation: Confirm the data state matches the expected recovery point. For scheduled DR tests, note the recovery point timestamp and the gap between that timestamp and the test execution time. This is your validated RPO for the test.

Network connectivity within the test: Confirm that services in the recovery environment that need to communicate with each other (app server to database, for example) can do so via the test VNet. Confirm that connections to external dependencies (DNS resolution, Azure services) work correctly.

Document all findings during the test window. Record pass/fail status for each validation item, the recovery point used, and the time elapsed from failover trigger to confirmed application health. This documentation is the DR test evidence record.

Clean up after the test

Once validation is complete, run cleanup. In the vault, navigate to the test failover job and click Cleanup Test Failover. Confirm the cleanup. ASR deletes the test VMs and releases all associated compute resources.

Do not leave test failover VMs running indefinitely. They incur compute charges at the same rate as production VMs, and they occupy quota in the recovery region. Build the cleanup step into your DR test runbook with a defined maximum test window.

After cleanup, verify in the vault that replication has continued normally during the test period. Check the RPO health metric to confirm no lag was introduced by the test.

How often to run DR tests

For regulated businesses under FCA, DORA, and PCI DSS, DR testing frequency is a compliance requirement. Typical requirements:

  • FCA and DORA (financial services): Annual testing of critical services is the minimum; many firms test quarterly. The test must demonstrate recovery within the defined impact tolerance (maximum tolerable period of disruption and maximum acceptable data loss).
  • PCI DSS (cardholder data environments): Annual DR testing with documented results.
  • ISO 27001: DR plan testing at planned intervals, with results reviewed.

For operational confidence rather than compliance, monthly test failover of a subset of protected items is a reasonable cadence for production environments.

Using test results as compliance evidence

The test failover record in ASR (accessible via Jobs > Test Failover) shows the start time, completion time, and outcome. Supplement this with your internal validation documentation (the pass/fail checklist, the recovery point timestamp, and the validated RPO gap) to create a complete evidence package.

A complete DR test evidence package includes: - The test failover job record from ASR showing the protected item, recovery point used, and test completion status - The validation log showing application health was confirmed - The recovery point timestamp and elapsed time (demonstrating whether the RPO target was met) - Sign-off from the responsible owner confirming the test satisfied the DR requirements

Store this evidence for the period required by your applicable compliance framework (typically 1-3 years).

Where Critical Cloud comes in

DR testing for regulated businesses is both a technical exercise and a governance requirement. Running a test failover correctly, validating the application thoroughly during the test window, and producing the evidence documentation that satisfies FCA and DORA audit questions is part of the operational resilience service we provide. We schedule and execute DR exercises for our clients on a defined cadence, with results documented and stored. As the world's first Powered by Datadog accredited partner, we monitor ASR replication health and RPO compliance continuously. See how Critical Support works.