5 Failover Patterns for High Availability in Azure

High availability is not a feature you enable. It is an architecture decision you make for each layer of your system: compute, database, storage, networking, and DNS. The five failover patterns below each suit a different combination of RTO requirement, RPO requirement, and budget. Applying the most expensive pattern to everything is as wrong as applying the cheapest.

Pattern 1: Availability zone redundancy (within a region)

The simplest tier of high availability. Availability zones are physically separate datacentres within the same Azure region, connected by high-bandwidth, low-latency fibre. Deploying across zones protects against a single datacentre failure, which is the most common form of Azure infrastructure outage.

Most Azure services support zone-redundant or zone-pinned deployment: - Virtual Machine Scale Sets: deploy instances across zones in the scale set configuration - Azure Kubernetes Service: deploy node pools with --zones specified - Azure SQL Database: enable zone redundancy in the service tier settings - App Service Premium tier: enable zone redundancy in the plan settings

Failover within a zone is automatic and typically transparent. For a zone-redundant SQL database, failover completes in seconds. For AKS, unhealthy pods in one zone are replaced in other zones.

Best for: Production workloads in a single region where the cost of multi-region is not justified and the SLA requires resilience against a single datacentre failure. This is the baseline for most regulated production environments.

Not sufficient for: Protection against a full regional outage (both datacentres in the affected zone failing, or the entire region having an incident affecting all zones).

Pattern 2: Active-passive with hot standby

Two environments: a primary that handles all traffic, and a secondary that is provisioned and running but not serving traffic. When the primary fails, DNS or traffic routing switches to the secondary. The secondary is ready immediately (no provisioning delay), but until failover, it is idle.

The RTO is the time to detect the failure and switch DNS or routing. With Azure Traffic Manager configured for failover routing and health probe checking the primary endpoint, failover typically completes within 1-2 minutes (depending on DNS TTL and probe interval).

The cost: you are running two full environments. The secondary environment costs approximately the same as the primary during normal operation. For workloads where the risk of extended downtime outweighs the cost of a standby environment, this is justified.

Azure services that support active-passive patterns: - Azure SQL Database auto-failover groups: A secondary database in another region stays in sync via continuous replication. Failover (automatic or manual) switches the connection string endpoint to the secondary, which becomes the new primary. - Azure Cosmos DB multi-region writes disabled: Primary region serves reads and writes; secondary regions receive replicated reads. Manual or automatic failover promotes the secondary. - App Service + Traffic Manager: Deploy the application to two regions. Configure Traffic Manager for Priority routing, with the secondary region as the backup.

Best for: Workloads where RTO requirements are in the 1-5 minute range and the cost of the standby environment is acceptable.

Pattern 3: Active-passive with warm standby

Similar to hot standby, but the secondary environment is scaled down (or partially running) during normal operation and scales up when failover is triggered. This reduces standby costs at the expense of a longer RTO.

A common pattern: the primary runs a 10-node AKS cluster. The secondary runs a 2-node cluster (enough to keep the deployment active and images cached) with autoscaling configured. On failover, the secondary scales to 10 nodes, which takes 3-5 minutes depending on VM provisioning time.

For databases, warm standby typically means a secondary database that is read-capable but with reduced compute (a smaller SKU that can be scaled up on failover). Azure SQL Hyperscale supports this with compute scaling while the data volume remains on shared storage.

Best for: Workloads with RTO requirements in the 5-15 minute range, where the full cost of hot standby is not justified but some standby infrastructure reduces recovery time compared to cold recovery.

Pattern 4: Active-active

Both environments serve production traffic simultaneously. Workloads are distributed between regions: either split (different users or operations routed to different regions) or fully redundant (every operation handled by both regions).

Azure Traffic Manager's performance routing method routes each user to the lowest-latency region. Front Door does the same with additional caching and WAF capabilities. Both serve as the traffic distribution layer for active-active architectures.

The critical design requirement for active-active: data consistency. If users in the UK write to the UK region and users in Ireland write to the Irish region, both regions need to serve consistent reads. Azure Cosmos DB with multi-region writes handles this with configurable conflict resolution. Azure SQL Database active-active requires application-level sharding or CQRS patterns to avoid write conflicts.

Active-active is the most expensive and most complex pattern. It provides the lowest RTO (failover is instant: the surviving region simply receives all traffic rather than a proportion) and is the right choice for tier-1 workloads where any meaningful downtime is unacceptable.

Best for: Workloads with zero or near-zero RTO requirements, globally distributed user bases where serving from the nearest region improves user experience, and businesses where downtime carries direct financial or regulatory consequences (financial services, payments, healthcare systems).

Pattern 5: Pilot light (cold standby with infrastructure-as-code)

The minimum viable DR pattern for low-criticality workloads. No secondary environment is running; instead, infrastructure is defined as code (Terraform, Bicep, ARM templates) and the deployment pipeline is tested regularly. On a failure, run the pipeline to provision the secondary environment from scratch.

The RTO is the time to provision plus the time to restore data from backup: typically 30 minutes to several hours depending on infrastructure complexity and data volume.

This is appropriate for workloads where the business can tolerate hours of downtime: internal tooling, non-production environments, batch processing workloads that can be rerun.

The critical discipline: test the pipeline regularly. A pilot light that has never been activated is not a DR plan. Run the pipeline monthly in an isolated environment to confirm it works. The first time you discover the pipeline has diverged from what is actually deployed should not be during a real incident.

Best for: Internal workloads, batch systems, and non-time-critical services where the cost of a running standby is not justified.

Matching patterns to requirements

Pattern	Typical RTO	Cost multiplier	Suitable for
Zone redundancy	Seconds	1.1-1.3x	Most production workloads
Active-passive hot	1-5 min	~2x	Regulated tier-1 workloads
Active-passive warm	5-15 min	1.3-1.6x	Important non-critical workloads
Active-active	Near-zero	2-3x	Highest-criticality, global services
Pilot light	30 min - hours	1.0x	Internal, non-critical workloads

Note: cost multipliers are illustrative. Actual costs depend on the specific services and tier configurations.

Where Critical Cloud comes in

Choosing the right failover pattern per workload and then proving it works through tested recovery exercises is part of the operational resilience discipline we run for regulated businesses. Under DORA and FCA rules, having the right architecture is necessary but not sufficient: you have to evidence that recovery works within your impact tolerance. As the world's first Powered by Datadog accredited partner, we monitor failover health, replication lag, and RTO compliance as live signals. See how Critical Support works.