5 Steps to Diagnose Azure Load Balancer Traffic Problems

Traffic problems with Azure Load Balancer look the same from the outside: connections fail, time out, or hit the wrong backend. The cause is almost always one of five things. The diagnostic path is to work through them in order, from the most common to the least, rather than guessing.

This guide is that order.

Step 1: Check backend pool health and configuration

Most Azure Load Balancer problems trace back to a backend pool issue: the backends are marked unhealthy, they are not listening on the right port, or they are misconfigured in the pool itself.

Start with the health probe. The load balancer sends health probes to every backend instance at a configurable interval. If a backend fails a configurable number of consecutive probes, it is removed from rotation. Traffic to a backend pool where all instances are failing probes goes nowhere.

In the Azure Portal, navigate to your load balancer resource and select Backend pools. Click into each pool and check the Health status column. If backends show as unhealthy, the probe is failing.

Why probes fail:

  • The backend VM or container is not running the application on the probed port. A probe configured for TCP:443 fails if the application is only listening on TCP:80.
  • The application is running but returning non-2xx responses for an HTTP health probe. An HTTP health probe expects a 200 response on the configured path. A 404 or 500 counts as unhealthy.
  • A Network Security Group on the VM's network interface or subnet is blocking the health probe. Health probes originate from the IP address 168.63.129.16 (Azure's infrastructure address). If your NSG does not allow inbound traffic from this address on the probe port, the probe is blocked before it reaches the application. Add an inbound NSG rule allowing TCP on the probe port from 168.63.129.16/32.
  • The guest OS firewall on the backend VM is blocking the port. Check Windows Firewall or iptables in addition to NSG rules.

Verify with a manual probe test: connect directly to the backend instance on the probe port from another VM in the same VNet. If it responds, the probe should too, which narrows the issue to the NSG or the probe configuration.

Step 2: Review load distribution rules

If backends are healthy but traffic is not reaching them, the load balancing rules may be incorrectly configured.

Each rule specifies a frontend IP address and port, a backend pool, a backend port, and the protocol. A mismatch between the frontend port (what clients connect to) and the backend port (what the application listens on) causes the load balancer to forward traffic to a port the application is not listening on.

Common configuration errors:

Port mismatch. Frontend rule is configured for port 443, backend pool instances are listening on port 8443, but the backend port in the rule is set to 443. Traffic arrives at the VM on port 443 where nothing is listening.

Wrong protocol. A TCP rule for a UDP service, or vice versa. UDP and TCP rules are distinct in Azure Load Balancer Standard.

Floating IP misconfiguration. Floating IP (also called Direct Server Return) is used in certain High Availability scenarios (SQL AlwaysOn, for example). If Floating IP is enabled, the backend instance must be configured to handle the load balancer's frontend IP directly, not the backend's own IP. Incorrect Floating IP configuration causes traffic to appear to arrive and then disappear.

Inbound NAT rules shadowing load balancing rules. If an inbound NAT rule covers the same frontend port as a load balancing rule, the NAT rule takes precedence. Check whether inbound NAT rules are interfering with expected distribution.

To verify rule configuration, use Connection Troubleshoot in Azure Network Watcher. Specify the source (a test VM or your client), the destination (the load balancer's frontend IP), the port, and the protocol. Network Watcher traces the path and reports where connectivity fails or succeeds.

Step 3: Investigate session persistence settings

Session persistence (also called session affinity or sticky sessions) controls whether a client is consistently routed to the same backend instance across multiple requests.

Azure Load Balancer offers three modes:

  • None (default): Each connection is independently distributed using a 5-tuple hash (source IP, source port, destination IP, destination port, protocol). The same client making multiple connections will likely hit different backends.
  • Client IP (2-tuple): All connections from the same source IP are routed to the same backend. The source port is ignored.
  • Client IP and Protocol (3-tuple): All connections from the same source IP using the same protocol are routed to the same backend.

Session persistence problems appear as intermittent failures for stateful applications. If your application stores session state on the backend instance (in memory, in a local file) rather than in a shared store (Redis, SQL), clients routed to a different backend will lose their session and see errors.

The correct long-term solution is to design the application to share state (use Azure Cache for Redis for session state). The short-term fix is enabling Client IP session persistence so clients consistently hit the same backend. This is a band-aid, not a fix, because it reduces the effectiveness of load balancing (one backend gets more traffic than others if one source IP generates more connections) and does not solve the problem if the backend instance is restarted.

If you recently changed session persistence settings, existing connections use the old settings until they are torn down and re-established. A brief connection interruption forces clients to re-establish and pick up the new settings.

Step 4: Check for SNAT port exhaustion

SNAT (Source Network Address Translation) port exhaustion is a Load Balancer problem that appears intermittently and is easy to misdiagnose as a network issue or application bug.

Here is the context: Azure Standard Load Balancer performs SNAT for outbound connections from backend pool members to the public internet. Each backend instance is allocated a number of SNAT ports from the load balancer's public IP address pool. Each active outbound connection uses one SNAT port. When all allocated SNAT ports are in use, new outbound connections from that backend instance fail.

SNAT exhaustion typically manifests as: outbound connection failures from specific backend instances at unpredictable times, often correlated with load spikes. The failures affect outbound connections (to databases, external APIs, other Azure services via public endpoint) rather than inbound connections through the load balancer.

Check for SNAT exhaustion:

In Azure Monitor, add a metric for your load balancer: SNAT Connection Count, filtered by Connection State = Failed. A high Failed SNAT connection count indicates exhaustion.

Fix SNAT exhaustion:

  • Add more public IP addresses to the load balancer's frontend IP configuration. Each additional public IP provides more SNAT ports.
  • Use Outbound Rules on the load balancer to explicitly allocate SNAT ports per backend pool member, rather than relying on the default allocation.
  • Use a NAT Gateway instead of load balancer SNAT for outbound traffic. NAT Gateway provides a much larger SNAT port pool and scales automatically.
  • Reduce outbound connection volume: use connection pooling in your application so connections to databases and external services are reused rather than opened and closed per request.

For applications making frequent short-lived outbound connections (REST API calls, for example), connection pooling is often the most impactful fix.

Step 5: Validate NSG rules at every scope

If the above steps have not identified the issue, systematically review NSG rules at every scope where they apply.

NSGs in Azure can exist at the network interface level and the subnet level. Both apply to traffic. The effective security rules for a VM are the combination of NSG rules at the NIC and NSG rules at the subnet. A rule that allows traffic on the subnet NSG can still be blocked by a rule on the NIC NSG.

Use the Effective Security Rules view in the Azure Portal (navigate to the VM, select Networking, then Effective security rules) to see the combined applied ruleset. This is the definitive view of what is actually allowed and blocked.

Common NSG issues in load balancer scenarios:

  • Inbound rule blocking the probe source address (168.63.129.16) as noted in Step 1.
  • Inbound rule allowing the load balancer frontend IP but blocking traffic from AzureLoadBalancer service tag. The service tag covers the load balancer infrastructure; blocking it breaks health probes and certain internal traffic paths.
  • Outbound rule blocking traffic from backend pool members to Azure services (storage, SQL, Key Vault) that the application depends on.

After confirming NSG rules, enable NSG Flow Logs (or VNet Flow Logs if you have migrated) and look for explicit deny entries for the traffic in question. A flow log entry with action D (deny) at the timestamp of a failed connection confirms which rule is responsible.

Where Critical Cloud comes in

Load balancer problems that mix rule configuration, probe failures, SNAT exhaustion, and NSG interactions are time-consuming to diagnose without the right tooling and clear methodology. We operate Azure networking for technology-led businesses and handle these diagnostics as a routine part of managing the platform. As the world's first Powered by Datadog accredited partner, we monitor load balancer health probe success rates, SNAT port utilisation, and NSG deny patterns as live signals, so traffic problems are identified before they become user-facing incidents. See how Critical Support works.