Scaling AI Workloads on Azure: Best Practices

AI workloads do not scale like web services. A web service scales by adding more instances; each additional instance handles proportionally more requests. An AI inference workload scales by managing GPU memory (which is finite and non-shareable per request), model loading time (which makes cold starts expensive), and batching (which trades latency for throughput in ways that web service patterns do not).

Getting AI scaling right on Azure requires understanding these differences and choosing the right services and configurations for each scenario.

Choose the right compute for your AI tier

Azure provides several compute paths for AI workloads. The right choice depends on whether you are running training, fine-tuning, or inference, and at what scale.

Azure OpenAI Service for inference workloads using OpenAI models (GPT-4, GPT-4o, embeddings, DALL-E). It is a managed API: Microsoft manages the infrastructure. You manage provisioned throughput (PTU) or token-based consumption. PTUs give guaranteed throughput at a fixed monthly cost; consumption is pay-per-token. For production workloads with predictable request volume, PTUs typically reduce cost compared to consumption-based pricing at scale. For variable or experimental workloads, consumption is more appropriate.

Azure Machine Learning managed online endpoints for serving custom models (fine-tuned, open-source, or proprietary). Managed online endpoints handle autoscaling, deployment management, and traffic routing between model versions. They support CPU and GPU instance types. Use these for custom models that are not served via Azure OpenAI.

Azure Kubernetes Service with GPU node pools for maximum control over inference serving. If you need custom inference servers (TorchServe, TensorFlow Serving, Triton Inference Server), multi-model serving on a single GPU instance, or specific GPU SKUs that are not available on managed endpoints, AKS with GPU nodes gives you the control you need at the cost of managing the infrastructure.

Azure Container Instances for batch or short-lived inference workloads that do not need continuous availability.

GPU memory is the primary scaling constraint

Unlike CPU memory, GPU VRAM is not easily swapped or over-committed. A model that requires 20 GB of VRAM cannot run on a 16 GB GPU. When scaling inference, the first constraint is fitting the model on the GPU.

Model quantisation reduces VRAM requirements at a small accuracy cost. A 70-billion-parameter model at 16-bit floating point requires approximately 140 GB of VRAM, across multiple A100 GPUs. The same model quantised to 4-bit requires approximately 35 GB, fitting on two A100-40GB GPUs. For deployment where cost and scalability matter more than maximum accuracy, quantised models are often the right choice.

For very large models, tensor parallelism splits the model across multiple GPUs. Azure NC A100 v4 series VMs support NVLink between GPUs for efficient multi-GPU model serving. Configure this at the serving layer (vLLM, Triton) rather than the infrastructure layer.

Batch inference vs real-time inference

The throughput-latency trade-off is fundamental to AI scaling.

Real-time inference (synchronous, sub-second response) requires maintaining warm model instances that can respond immediately. This means always-on compute: minimum replicas must be non-zero, and cold-start time must be avoided. This is expensive but necessary for user-facing applications.

Batch inference (asynchronous, minutes to hours) allows GPU utilisation to be maximised through dynamic batching: requests are queued and processed together, filling the GPU to maximum utilisation. This dramatically reduces cost-per-inference at the expense of latency. For workloads like document processing, image classification, or bulk embedding generation, batch inference is the right model.

Azure Machine Learning batch endpoints are designed for this pattern: trigger a batch job with a list of inputs, receive results when processing is complete. Batch endpoints scale to zero between jobs, eliminating idle compute cost.

Autoscaling for inference endpoints

Managed online endpoints in Azure ML support autoscaling via Azure Monitor autoscale rules. The key metrics for inference scaling:

Requests per second (RPS): Scale out when RPS per instance exceeds the target throughput per instance. Determine the target by load testing a single instance to find its saturation point.

GPU utilisation: Scale out when GPU utilisation exceeds 80% sustained. Scale in when it drops below 30%.

Queue depth: If requests are queuing at the endpoint, scale out immediately. A growing queue means the current instance count cannot sustain incoming request volume.

Minimum replica count should be non-zero for production endpoints where cold start time is unacceptable. Model loading can take 30-120 seconds for large models; a zero-minimum endpoint that scales to zero under low load will return errors for users until a new instance has loaded the model.

Set a maximum replica count that reflects the maximum GPU capacity you have provisioned (or can provision) in your Azure subscription. GPU quota is regional and request-based: verify your quota before setting high maximum replica counts.

Content delivery and caching for inference outputs

For workloads where the same prompt generates the same output (semantic search, product descriptions, Q&A from a fixed knowledge base), caching inference outputs reduces API calls and cost.

Azure Cache for Redis is the natural caching layer for inference outputs. Cache the response for a given input (normalised to remove whitespace variations) with a TTL that reflects how quickly the underlying model or data changes. A product description generated from a fixed product catalogue can be cached for days. A news summary needs a much shorter TTL.

For embedding-based retrieval-augmented generation (RAG) systems, precompute and store embeddings for your document corpus in Azure Cognitive Search or a vector database. Recompute only when documents change, not on every query.

Cost management for AI workloads

AI inference is expensive at scale. The main cost levers:

Model selection: Smaller, cheaper models are faster and significantly less expensive. GPT-4o mini costs approximately 15x less per token than GPT-4o. For tasks where a smaller model achieves acceptable quality, the cost difference is material.

Provisioned throughput: For consistent, high-volume inference workloads on Azure OpenAI, provisioned throughput units (PTUs) are typically cheaper than consumption-based tokens above a certain request rate. Run the PTU calculator with your actual request volume.

Scale to zero for non-production: Development and staging inference endpoints should scale to zero outside business hours. A GPU instance running idle overnight generates the same compute cost as one under load. Use Azure Automation runbooks to stop non-production endpoints on a schedule.

Monitor token consumption: Set alerts on token consumption for Azure OpenAI. A bug in prompt construction (accidentally sending a large context on every request) or an unexpected increase in usage will appear as a sudden spike in token consumption before it becomes a large invoice.

Where Critical Cloud comes in

AI workload operations are different from traditional web service operations. The observability signals, the cost drivers, and the failure modes are all specific to the AI inference layer. We operate Azure AI infrastructure for technology-led businesses, with Datadog monitoring covering GPU utilisation, inference latency, token consumption, and endpoint scaling events as live signals. As the world's first Powered by Datadog accredited partner, we bring the same observability rigour to AI workloads that we apply to the rest of the Azure estate. See how Critical Support works.