Observability and AI: a two-way street

Most of the conversation about AI and observability focuses on one direction: using AI to improve observability: better anomaly detection, smarter alerting, faster root cause analysis. That's real and it matters, and I'll come back to it. But the more interesting part of the relationship is the other direction: as AI becomes embedded in production workloads, it creates entirely new observability challenges that the field is only beginning to solve.

Direction one: AI making observability better

I've written separately about AIOps, so I'll keep this brief. The current generation of AI-assisted operations tools is genuinely useful in a way the previous generation wasn't. Alert correlation that reduces noise rather than adding to it, anomaly detection with low enough false positive rates that engineers trust it, log pattern clustering that surfaces issues without requiring manual query authorship.

The common thread: all of this works well when the underlying observability data is clean, consistent and well-structured. The AI doesn't compensate for poor instrumentation; it amplifies good instrumentation. If you have well-structured metrics, traces and logs from a platform like Datadog, AI makes that data substantially more actionable. If you have fragmented, inconsistently labelled telemetry, AI surfaces fragmented, inconsistently labelled findings. The garbage-in rule still applies.

Direction two: AI in production changing what observability needs to do

Here is where it gets more interesting. Organisations are deploying AI workloads into production: inference endpoints, embedding services, LLM-backed features, vector search, model fine-tuning pipelines. These workloads behave differently from the systems we've built observability practice around, and standard monitoring doesn't capture the failure modes that matter.

A conventional service fails in ways that are relatively easy to observe: it goes down, it gets slow, it starts erroring. An AI system can fail in ways that are much harder to detect: it returns responses that are plausible but wrong, its quality degrades gradually as a model becomes stale, it produces inconsistent results under conditions that are hard to reproduce, it costs more per request in ways that aren't visible until the bill arrives.

Traditional observability metrics (latency, error rate, throughput) don't tell you whether an LLM is hallucinating more than it was last week. They don't tell you whether retrieval quality in a RAG pipeline has degraded. They don't tell you whether your model's responses are drifting in a direction that users are noticing but your dashboards aren't.

What good AI observability looks like

The emerging practice treats AI systems as first-class observability subjects with their own telemetry requirements:

Token economics and cost per inference. What each inference costs, how that varies across request types, where the expensive paths are and whether they're working proportionally harder.

Quality signals beyond latency. User feedback signals, downstream conversion where measurable, consistency across repeated similar queries. These are harder to capture than a latency histogram but they're the metrics that tell you whether the system is actually working.

Model versioning and drift detection. When a model changes (whether through retraining, provider update or configuration change) the observability stack should surface whether behaviour changed and in what direction.

Retrieval quality in RAG systems. For retrieval-augmented pipelines, whether the retrieved context is relevant, whether it's being used correctly, and whether freshness of the knowledge base is degrading.

Datadog's AI observability capabilities are developing in exactly this direction, and it's one of the reasons we're building our AI Factory service on top of it. Instrumenting AI workloads properly requires the same rigour as instrumenting any other production system, and the cost of not doing it is invisible degradation rather than visible failure.

The organisational point

The teams deploying AI into production mostly came up through a world where observability meant infrastructure and application metrics. The tooling and practice for instrumenting AI systems is newer and less familiar. The risk is that AI workloads go into production with the operational discipline of a prototype: shipping fast, monitoring lightly, and discovering problems reactively.

The relationship between AI and observability is a genuine two-way street. AI makes observability better. AI in production requires observability to evolve. Both are true simultaneously, and the organisations that treat them as a coupled problem rather than two separate ones will be better positioned as AI workloads become a larger part of what runs in production.

Direction one: AI making observability better

Direction two: AI in production changing what observability needs to do

What good AI observability looks like

The organisational point

Running AI workloads in production?