Self-Healing Systems in Azure: Key Design Patterns

An incident that wakes someone up at 3am and is resolved by that person restarting a service is not a good outcome. The service was unhealthy for minutes to hours. The on-call engineer lost sleep. The root cause is still unknown. A better outcome is the service detecting the problem, recovering automatically, and logging the event for analysis in the morning. That is what self-healing systems do.

This guide covers the patterns that enable Azure systems to detect and recover from failures without human intervention, and the Azure services that implement each pattern.

Health probes: detect the failure

Self-healing starts with detection. Azure services use health probes to determine whether an instance is fit to receive traffic. If it is not, the infrastructure removes it from rotation and replaces it.

Azure Kubernetes Service uses liveness and readiness probes on each container. A liveness probe that fails causes the container to be restarted. A readiness probe that fails removes the container from the Service endpoint until it recovers, without restarting it. Design separate probes: liveness for "is the process alive," readiness for "are all dependencies available." A container that is alive but unable to reach its database should fail readiness without triggering a restart.

Azure App Service health checks poll a configured endpoint. Instances failing consecutive health checks are removed from the load balancer. With autoscaling enabled, App Service will replace unhealthy instances rather than running at reduced capacity.

Virtual Machine Scale Sets use Application Health extension or Load Balancer health probes to detect unhealthy VMs and trigger automatic repairs: replacing the unhealthy instance with a new one from the base image.

The health probe endpoint itself must be representative of actual service health. An endpoint that returns 200 even when the application is broken is not a health probe, it is a monitoring gap. Check critical dependencies in the health probe: database connectivity, message queue access, configuration service availability.

Retry pattern: handle transient failures

Transient failures are temporary conditions that resolve themselves: a brief network interruption, a momentary database connection reset, a downstream service under momentary load. Most distributed system failures are transient. The retry pattern handles them without surfacing them to the user.

The key design decisions in a retry implementation:

What to retry. Retry on connection failures, timeout exceptions, and rate-limit responses (HTTP 429, HTTP 503). Do not retry on authentication failures, not-found responses, or business logic errors. Retrying a 401 ten times wastes resources; the result will not change.

Exponential backoff with jitter. Each retry attempt should wait longer than the previous one. Constant or linear backoff hammers a struggling dependency without giving it time to recover. Jitter (a small random offset) prevents multiple clients from retrying in synchronised waves:

var delay = TimeSpan.FromMilliseconds(
    Math.Pow(2, retryAttempt) * 100 + Random.Shared.Next(0, 100));

Maximum retry count. Retrying indefinitely converts a transient failure into an infinite loop that holds a thread or connection. Three to five attempts is usually appropriate; match the ceiling to the operation's criticality and your SLA.

Idempotency. Only retry operations that are safe to repeat. A database write that is not idempotent should not be retried blindly: you may execute it twice. Design writes to be idempotent where retries are required, or accept that retries may not be possible.

Azure SDK clients implement retry policies internally for Azure service calls. Use them rather than reimplementing retry logic around Azure SDK calls. The built-in policies handle Azure-specific error codes and recommended backoff strategies.

Circuit breaker: stop amplifying failures

The circuit breaker pattern prevents an application from repeatedly calling a failing dependency. Without it, an unhealthy downstream service receives a constant stream of calls from all its consumers, preventing it from recovering.

The pattern uses three states:

Closed (normal operation): Calls pass through. Failures are counted.

Open (dependency is failing): Calls are rejected immediately without attempting the downstream call, returning a fast failure to the caller. The circuit remains open for a configured timeout.

Half-open (testing recovery): After the timeout, a limited number of trial calls pass through. If they succeed, the circuit closes. If they fail, it reopens for another timeout period.

In .NET, Polly implements circuit breakers:

var circuitBreakerPolicy = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutException>()
    .CircuitBreakerAsync(
        handledEventsAllowedBeforeBreaking: 5,
        durationOfBreak: TimeSpan.FromSeconds(30),
        onBreak: (exception, duration) => logger.LogWarning("Circuit breaker opened"),
        onReset: () => logger.LogInformation("Circuit breaker closed"),
        onHalfOpen: () => logger.LogInformation("Circuit breaker half-open"));

Combine the circuit breaker with the retry policy: retries handle individual transient failures, the circuit breaker handles sustained dependency outages. Stack them with Polly's PolicyWrap:

var resilience = Policy.WrapAsync(circuitBreakerPolicy, retryPolicy, timeoutPolicy);

Bulkhead: isolate failures

The bulkhead pattern limits concurrent access to a resource to prevent a single misbehaving component from consuming all available resources and degrading other components.

In practice: if your application calls three downstream services and one is slow, threads waiting for that slow service will eventually fill your thread pool, starving calls to the other two healthy services. A bulkhead limits how many concurrent calls can go to each downstream service, so a misbehaving dependency can only consume its allocated portion of resources.

Polly's Bulkhead policy limits concurrent executions and queues excess calls up to a queue depth:

var bulkheadPolicy = Policy.BulkheadAsync(
    maxParallelization: 10,
    maxQueuingActions: 25,
    onBulkheadRejectedAsync: context =>
    {
        logger.LogWarning("Bulkhead rejected call");
        return Task.CompletedTask;
    });

In AKS, resource quotas and pod disruption budgets implement bulkhead-like isolation at the container level.

Graceful degradation: maintain partial functionality

When a dependency is unavailable, the ideal behaviour is to degrade gracefully rather than fail completely. The application continues to function with reduced capability rather than returning an error.

Examples of graceful degradation:

A search function returns cached or empty results rather than an error when the search service is unavailable
A recommendation widget is hidden rather than failing the page load when the recommendations service is down
An application serves read requests from a cache when the primary database is unavailable

Implement graceful degradation by wrapping dependency calls in the circuit breaker pattern with a fallback policy:

var fallbackPolicy = Policy<IList<Product>>
    .Handle<BrokenCircuitException>()
    .FallbackAsync(
        fallbackValue: new List<Product>(),
        onFallbackAsync: async (result, context) =>
        {
            logger.LogWarning("Recommendations service unavailable, returning empty list");
            await Task.CompletedTask;
        });

Design which degraded states are acceptable before an incident forces the decision. Document them. Ensure the fallback behaviour is tested. An untested fallback path is likely to fail under the unusual conditions that cause it to be invoked.

Autoscaling: recover from capacity problems

Health probes handle instance failures. Autoscaling handles capacity problems: the service is running but overwhelmed.

Configure autoscale on Azure VM Scale Sets, AKS node pools, App Service plans, and Azure Container Apps based on the metric that reflects actual load rather than infrastructure utilisation. CPU and memory are often poor choices for web services: a service can have high CPU because it is doing useful work efficiently, or because it is thrashing. Queue depth, request rate, or active connections are typically better scale triggers.

Scale-in (reducing instance count) is as important as scale-out. A system that scales out under load but never scales in pays for idle capacity indefinitely. Configure scale-in policies conservatively: require sustained low utilisation before removing instances, and never scale below the minimum needed for high availability.

Where Critical Cloud comes in

Building these patterns correctly and ensuring they actually work under production failure conditions, not just in controlled tests, requires operational experience with how Azure services fail. We design and operate resilient Azure systems for technology-led and regulated businesses where the cost of downtime is real. As the world's first Powered by Datadog accredited partner, we monitor circuit breaker state changes, health probe failures, and autoscale events as operational signals, so recovery is visible and the patterns are validated under production conditions. See how Critical Support works.