Datadog Pricing and Cost Optimisation: A Practitioner's Guide

Datadog's pricing model, and why it matters

Datadog is a pay-per-usage platform. You do not pay a flat rate for access to the software. You pay for what you send, what you index, what you retain, how many custom metric series you generate, how many traces you store. Each product (logs, APM, infrastructure, synthetics, security, cloud cost management) has its own usage dimension, and those dimensions each have a cost that grows when the volume grows.

That model is actually well aligned to value, in principle. If your observability platform is delivering more coverage because your business is growing, some cost growth is appropriate. The problem is that telemetry grows faster than either business or engineering activity can explain, and it grows in ways that are largely invisible until someone looks for them. A developer adds a tag that multiplies a single metric into millions of series. Debug logging left enabled after a production incident ships for six months before anyone notices. A Kubernetes cluster's container logs get shipped in full, indexed at default retention, because that was the setup when the cluster was three nodes and no one revisited it when it became thirty.

The result is a Datadog bill that finance and engineering both struggle to explain, attributed to no particular decision and owned by no particular team. That is the underlying structure of nearly every Datadog cost problem I have encountered. It is not a pricing model problem. It is a governance problem.

The most common cost drivers

Before you can govern Datadog cost, you need to understand what is actually driving it. These are the categories responsible for the majority of unexpected spend, in roughly the order we encounter them.

Logs

Logs are, by some margin, the most common source of Datadog cost surprises. They are also the most volume-sensitive product: every service you add, every container you run, every increase in traffic adds log volume. The default configuration (ship everything, index everything, retain at the default period) works fine at low scale. It becomes expensive fast as infrastructure grows.

The most important thing to understand about Datadog log pricing is the distinction between ingestion, indexing and retention. Ingestion is the act of sending a log event to Datadog. Indexing is making it queryable: searchable in Log Explorer, available for alerting, visible in dashboards. Retention is how long you keep it queryable. Not all ingested logs need to be indexed. Not all indexed logs need the same retention period. The decision about which logs get indexed and for how long is the single most impactful cost control in the logs product, and most teams have never explicitly made it. Their indexes are on defaults and their retention is whatever Datadog set when they first configured the account.

Common log cost culprits: Kubernetes and container stdout/stderr at full verbosity, verbose application output at INFO or DEBUG level in production, health check and readiness probe responses, load balancer access logs retained at fifteen days across every index. Each of these is individually addressable. Together they often account for the majority of log spend.

We have a full guide to Datadog logs pricing and log cost control that covers ingestion, indexing and retention in detail, along with the specific controls (exclusion filters, retention tuning, Observability Pipelines) that address each pattern.

Custom metric cardinality

Custom metric costs in Datadog are billed by the number of distinct time series you generate, not by the number of metric names. This is the detail that catches nearly every team out.

A metric name is just a label. A time series is a metric name combined with a specific set of tag values. Every unique combination of tag values creates a separate series. A metric with ten tag values across five dimensions creates 10⁵ potential series (a hundred thousand), not five. Add a tag whose values are unbounded (user_id, request_id, pod name, session identifier) and cardinality explodes.

The canonical example: a request duration metric with service and endpoint tags is entirely reasonable: twenty services, fifty endpoints, one thousand series. Add a status_code tag and you are at ten thousand. Add user_id and you are at one per user. At any meaningful scale, a single high-cardinality tag on a single metric can account for millions of series and a disproportionate share of your custom metric allocation.

This is not a theoretical concern. It happens constantly, because the instrumentation decision and the cost consequence are separated by the billing cycle. A developer instruments a new service, adds useful-sounding tags, ships it. The metric series count accumulates. The usage report at the end of the month shows a number that nobody can explain to their line manager.

The fix is tag governance: knowing which tags are acceptable on custom metrics, what cardinality they carry, and who approves tags with unbounded or large value sets before they ship. It is also worth knowing that Datadog offers Flex Metrics, a capability that lets you ingest at high cardinality but store only the series you actually query, which addresses the use cases where you genuinely need high-cardinality instrumentation.

More on this in our guide to Datadog metrics cost optimisation and cardinality control.

APM and tracing volume

APM costs are driven by the volume of traces you ingest and retain. The most common cause of unexpectedly high APM spend is applying the same sampling rate to every service, environment and endpoint regardless of traffic volume or operational value.

A health check endpoint called every ten seconds by your load balancer does not need full-fidelity tracing. A low-risk background job processing internal queue messages does not need the same trace retention as a customer-facing checkout flow. Applying production sampling rates to staging environments, or failing to exclude monitoring traffic from trace ingestion, adds volume that serves no operational purpose.

The practical control here is error-biased and latency-biased retention: keep traces for requests that errored, requests that were slow, and a statistical sample of successful, normal-latency requests. This gives you the investigation data you actually need while eliminating the vast majority of routine traces. Set environment-appropriate rates; production is not the same as staging.

Kubernetes and container scale

Kubernetes makes Datadog cost governance harder in two specific ways. First, it generates large volumes of infrastructure telemetry (container logs, control plane events, pod-level metrics) from components that are interesting to inspect occasionally but not to monitor continuously. Second, it creates ephemeral infrastructure: pods come and go, host equivalents are counted, and the agent discovers and monitors everything in scope unless you explicitly configure it not to.

The default Kubernetes agent configuration is deliberately broad: discover everything, monitor everything. That is the right default for initial setup. It is not the right configuration for a production environment at scale, where it means you are monitoring hundreds of ephemeral pods across dozens of namespaces at full fidelity, including CI workloads, build agents and namespaces whose components have no operational significance.

The fix requires a deliberate coverage review: which namespaces matter, which workloads need container log collection, which control plane components warrant ongoing monitoring versus occasional inspection. This is not about reducing coverage; it is about making coverage decisions consciously rather than accepting the default at every scale.

Retention defaults and product sprawl

Two underrated cost drivers that rarely get their own section in vendor documentation.

Retention defaults: Datadog's default log retention periods exist for a reason, but they are not calibrated to your operational requirements or compliance obligations. Most teams find, when they actually look, that the default retention on most of their log indexes is longer than any incident investigation they have ever conducted. Applying compliance-grade retention to operational logs because it was the default wastes money on data nobody ever queries.

Product sprawl: capabilities enabled during an evaluation and never scoped back. Synthetics tests running at frequencies set during initial setup without review. Session replay enabled across all user journeys when only the checkout flow genuinely needs it. Each represents a line item that was intentional at some point and has since been running unreviewed. A product-by-product audit (asking, for each enabled capability, whether it is delivering value at its current scope and configuration) is one of the most straightforward cost reviews you can run.

What Observability FinOps means in practice

Cloud FinOps (the practice of applying financial operations discipline to infrastructure spend) is now reasonably well established. Teams know they should attribute cloud cost to services and teams, set up anomaly alerts, review reserved instance utilisation and engage in renewal planning. Most organisations with meaningful cloud spend have some version of this practice, even if it is not called FinOps.

Observability FinOps applies the same discipline to the data layer. The logs, metrics and traces your systems generate are not free to collect, index and retain. The platform that processes them is a significant operational cost, and it behaves differently from infrastructure cost in ways that make standard FinOps practices insufficient on their own.

The key difference is that observability cost is driven by behaviour, not by provisioning. A developer adding a high-cardinality tag to a metric increases Datadog cost without touching any cloud resource. A verbose logging configuration that ships debug output in production adds cost without any infrastructure change. These costs do not appear in AWS Cost Explorer or the Azure billing portal. They are invisible to the standard cloud FinOps toolchain.

Observability FinOps closes that gap. It means having attribution: knowing which teams and services generate which proportion of your Datadog spend. It means having governance: policies and standards that prevent cardinality explosions and retention defaults from going unreviewed. It means having a cadence: weekly usage checks, monthly cost reviews with attribution data, quarterly renewal readiness assessments. And it means treating renewal as a planned event, not a crisis.

The most expensive thing about a Datadog renewal is entering it without data. Not because Datadog is inflexible, but because without usage attribution and growth trajectory data, procurement cannot commit with confidence at the right level. They either over-commit to avoid overage risk or under-commit and face on-demand pricing. Neither is optimal.

We have written in detail about the Datadog FinOps operating model: what Observability FinOps means structurally, which functions need to be involved (engineering, platform, SRE, finance, procurement, security), and what the governance cadence should look like. The short version: no single team can own this effectively alone, and the cadence is weekly, monthly, quarterly.

The five levers that actually move the cost

There are ten or more potential controls across the Datadog platform. These five consistently make the largest difference.

Log indexing and retention, tuned per index

Most Datadog accounts have one or two log indexes with everything in them at the same retention period. The move to make is: separate indexes by log type (application logs, infrastructure logs, security logs), set retention appropriate to the operational purpose of each, and apply exclusion filters that drop zero-value events before they consume indexing budget. Health check responses, debug output, kube-proxy noise: none of these need to be in your indexed logs. Getting this right is a permanent cost reduction. Unlike a commitment renegotiation, you cannot undo it unless you deliberately undo it.

Tag governance on custom metrics

Define which tags are permitted on custom metrics. Define what acceptable cardinality looks like. Build a review step for new instrumentation that adds tags with large or unbounded value sets. This sounds like bureaucracy, but in practice it is a small addition to whatever review process you already have for shipping new services, and it prevents the most common cause of custom metric cost growth at the point where it is cheapest to prevent it.

APM sampling by service, environment and endpoint type

Configure head-based and tail-based sampling deliberately. Set environment-specific rates; staging does not need production-level trace retention. Exclude health checks, readiness probes and monitoring traffic from trace ingestion. Apply error-biased and latency-biased retention rules so that the traces you keep are the ones that matter for investigation. Each of these changes is a permanent reduction in the traces you pay to store.

Observability Pipelines for pre-ingest control

Datadog Observability Pipelines let you intercept, route, filter and transform telemetry before it reaches its destination. For cost governance, the primary use is pre-ingest filtering: dropping events that would not be queried or alerted on before they consume indexing budget. The distinction between pipeline filtering and index exclusion filters matters: pipeline filtering prevents ingestion from occurring at all, which means the event never touches the ingestion meter. Index exclusion filters reduce indexing cost but the ingestion has already been counted.

Pipelines also let you route telemetry to different destinations by type (operational logs to indexed Datadog logs, compliance archives to S3, security-relevant sources to a SIEM) and redact sensitive data before it enters the platform. For high-volume environments, the pre-ingest control that pipelines provide is one of the most cost-effective levers available. We cover this in depth in the Observability Pipelines cost control guide.

Cloud Cost Management for the other bill

There are two dimensions to cost for any team running Datadog on AWS or Azure: the cloud infrastructure cost and the Datadog platform cost. Cloud Cost Management closes the gap between them by bringing AWS and Azure spend data into the same platform as your observability data. Engineers can see cloud cost alongside performance metrics, attribute spend to services and deployments, and set anomaly alerts when cloud spend deviates from expected patterns, without switching tools.

This matters for cost governance because the two cost dimensions interact. A decision to run more Kubernetes nodes adds cloud cost and Datadog infrastructure cost simultaneously. Understanding both in the same context makes the cost of scaling decisions visible to platform teams at the point where they can act on it. More on this in the Datadog Cloud Cost Management guide.

A 30-day review framework

When a team comes to us with a Datadog cost problem, we typically structure the engagement across four weeks. This is not the only way to approach it, but it is the one that consistently produces both quick wins and durable governance.

Week one: build the baseline. Export usage reports across every product. Map spend to usage dimensions: log ingestion, indexed events, custom metric series, APM spans, host counts. Build the ground truth that most teams do not have. The exercise itself usually surfaces the first surprises: a product no one remembered enabling, a log source with volumes nobody expected, a retention setting that has been running at the default since the account was created.

Week two: identify waste and risk. Audit high-cardinality custom metrics: which metric names have the most series, which tags are driving them. Find log sources with poor signal-to-noise ratio. Review APM sampling rates by environment and endpoint type. Identify retention settings longer than operational needs. Rank findings by cost impact. Most of the value in this step comes from the ranking: there are always a small number of items responsible for a disproportionate share of the waste.

Week three: implement controls. Apply retention changes. Configure Observability Pipelines for the noisiest sources. Set sampling rules. Enforce tag governance for services being deployed in the current sprint. Every change tracked with before and after usage metrics, so the impact of each control is visible independently.

Week four: governance and renewal preparation. Build the usage attribution dashboard, set budget alerts and cost anomaly monitors, document the tag governance policy. Produce the renewal readiness report: actual usage, growth trajectory and the product mix that is genuinely right for the next contract period. This is the document that procurement uses to enter renewal negotiations from a position of data rather than guesswork.

There is a detailed version of this in the Datadog pricing and cost optimisation guide, along with a full cost driver table and the complete set of optimisation levers.

The five functions that need to be in the room

One of the consistent patterns in Datadog cost governance failures is a single team (usually either engineering or finance) trying to own the problem alone. It does not work, because the problem spans at least five functions.

Engineering generates the telemetry. Their instrumentation decisions (which metrics to emit, how verbose to log, which tags to add) are cost decisions. Without cost visibility, those decisions are made without relevant information.

Platform and SRE set the standards: tagging conventions, agent configuration, retention policies, Observability Pipeline configuration. They are the layer where governance policy becomes engineering reality. You cannot have effective cost governance without platform ownership of the standards.

Finance needs attribution data to understand Datadog as a line item. Without it, Datadog is a cost they cannot explain to anyone above them. With it, they can manage it as a predictable operational cost aligned to business activity.

Procurement negotiates the contract. They can only commit at the right level if they have actual usage data, growth trajectory and a view of the product mix genuinely needed for the next period. Without that data, they are committing on instinct, which means either over-committing or under-committing, both of which are expensive.

Security and compliance have requirements that directly affect log retention, routing and the sources that must be ingested. Including them in the retention and routing conversation ensures compliance obligations are met without paying for retention that exceeds them.

The Observability FinOps model is the framework for coordinating across all five. We cover the stakeholder map and governance cadence in the Datadog FinOps guide.

Should you replace Datadog to reduce costs?

This comes up in almost every cost conversation, usually from someone in finance or from an engineering leader who has been looking at alternatives and noticed lower per-unit pricing on some dimensions.

My honest answer: almost never, and almost certainly not before you have completed a structured cost review.

Replacing an observability platform is expensive, risky and time-consuming in ways that are easy to underestimate from a spreadsheet comparison. Migrations consume engineering capacity that could be shipping product. You lose historical data: Datadog's continuous profiling, trace history, long-term dashboards and alert history do not migrate cleanly to alternatives. Your on-call engineers spend weeks learning new tools instead of responding to incidents. And alternative platforms, as they are actually used at scale, frequently develop their own cost growth patterns, because the underlying problem is telemetry governance, not the platform.

In our experience, most teams that feel Datadog is too expensive have not yet done a structured cost review. When they do, they typically find enough waste to bring cost meaningfully under control, often enough to change the renewal conversation from a defensive one to a constructive one about expanding into capabilities they have not yet used.

The cases where replacement makes sense are genuine: if Datadog's product set genuinely does not match your operational needs, if your usage pattern structurally disadvantages you under Datadog's pricing model, if you have exhausted the governance controls and the cost remains unjustifiable. Those situations exist. But in ten years of operating observability platforms, I have found them to be a small minority of the cases that present as "Datadog is too expensive."

The why is Datadog expensive guide covers the most common causes and the step-by-step approach we recommend before considering replacement.

When to bring in a Datadog partner

You can run through most of what is in this article without external help. Datadog's own usage metering pages, usage attribution, and Metrics Summary give you the data. The controls (retention, exclusion filters, sampling rules, pipeline configuration) are all platform-native. An experienced Datadog engineer on your team can implement them.

The cases where a partner adds the most value are:

Renewal is imminent. You have ninety days or fewer before renewal and you do not have a governance baseline or usage attribution in place. A 30-day review before renewal is still time to make a meaningful difference, but it requires focused effort and someone who has done it before.
You do not know where the cost is coming from. If your team cannot tell you which services or teams account for the majority of your Datadog spend, attribution is not in place and governance cannot start until it is. Setting that up quickly requires someone who knows the tooling.
Teams are afraid to add more telemetry. When engineering teams are throttling their own instrumentation because they do not know what it will cost, the observability platform is failing its core purpose. That is a governance and attribution problem, and fixing it requires more than a cost review; it requires a governance model that makes cost predictable for the teams adding coverage.
The bill is growing and nobody is accountable for stopping it. Cost problems that have no named owner tend to persist. A partner engagement with a clear scope and clear outcome creates the accountability that is often missing.

As the world's first Powered by Datadog accredited MSP, we have run Datadog environments for customers for years. Datadog cost governance is not a consulting offering we bolted on; it is part of how we operate. When we take on a customer, we take on accountability for their Datadog spend alongside accountability for their platform reliability. Those two things are not separable in a well-run managed observability service.

The short version

Datadog costs grow because telemetry grows without governance. Logs ship in full. Metrics accumulate high-cardinality tags. Retention stays at default. Nobody has reviewed what is being ingested since the account was set up. None of this is anyone's fault in particular; it is the natural consequence of a usage-based platform in an environment that prioritises shipping product over managing operational costs.

The fix is not to replace the platform. It is to govern it. Know what you are ingesting, attribute it to the teams and services generating it, apply the controls that eliminate waste, and build the cadence that prevents it from returning. That is Observability FinOps, and it is the approach we apply every time we take on a new customer's Datadog environment.

If you want a more detailed treatment of any specific cost driver, the guides below cover each one in depth.