Rethinking MTTR: Why It’s Time to Move Beyond an Outdated Metric in SRE

Written by Critical Cloud | Mar 1, 2025 3:15:45 PM

The Ongoing Debate Around MTTR

In the world of Site Reliability Engineering (SRE) and DevOps, few metrics are as widely used, and as widely debated, as Mean Time to Repair (MTTR).

For decades, MTTR has been a staple of incident management, offering a seemingly simple way to track system reliability and response efficiency. The logic appears sound: a lower MTTR suggests a team is getting better at fixing things, while a higher MTTR indicates potential weaknesses in incident response processes.

But in recent years, this metric has come under scrutiny. Thought leaders in resilience engineering and complex systems theory argue that MTTR is not just flawed—it might be actively misleading. During a recent episode of the Google SRE Prodcast (Dec 4, 2024), Casey Rosenthal and John Allspaw went as far as saying:

“We have the math, we have the studies, we have the thought, we have proof that MTTR is horseshit.”

This is a strong claim. But does it hold up? Should SREs and DevOps teams abandon MTTR altogether, or does it still have a place in modern incident management?

In this article, we’ll explore:

What MTTR is and how it has been used
The key arguments against MTTR
Alternative metrics that might provide better insights
When (if ever) MTTR still holds value

What Is MTTR and Why Was It Useful?

Defining MTTR

Traditionally, MTTR is defined as:

MTTR = Total time to repair incidents / Number of incidents

It is meant to represent the average time it takes a team to restore a system to normal operation after an incident occurs.

Variations of MTTR exist, including:

Mean Time to Detect (MTTD) – How long it takes to recognize a failure.
Mean Time to Acknowledge (MTTA) – Time until the team begins responding.
Mean Time Between Failures (MTBF) – How frequently failures occur.
Mean Time to Restore Service (MTTRS) – Time to full service recovery.

In many companies, MTTR is treated as a key performance indicator (KPI), often appearing in SLAs (Service Level Agreements) and SLOs (Service Level Objectives).

Why Did People Rely on MTTR?

At first glance, MTTR seems useful because:

It’s simple and easy to communicate – Executives and non-technical stakeholders can understand "we’re fixing issues faster."
It’s measurable and trackable – Organisations can plot trends over time.
It incentivises speed – Teams are motivated to resolve incidents quickly.
It aligns with traditional IT operations thinking – Many older IT frameworks (ITIL, COBIT) are built around uptime and restoration speed.

For years, MTTR worked well, or at least seemed to, when systems were simpler and failures were more predictable. But in today’s distributed, cloud-native, and highly complex systems, its limitations are becoming increasingly apparent.

The Case Against MTTR

Despite its long history, many experts now argue that MTTR is fundamentally broken when applied to modern systems. Here’s why:

1. Incidents Vary Wildly in Complexity

MTTR assumes that all incidents are roughly similar, but this couldn’t be further from the truth. Consider two outages:

A quick restart of a failed container (5 minutes)
A major database corruption issue requiring manual recovery (6 hours)

Averaging these together tells us almost nothing useful. The arithmetic mean doesn’t reflect real-world distributions, where most incidents are short, but a few are long tail events that dominate recovery time.

This is a well-documented statistical problem: incident resolution times follow a power-law distribution, not a normal distribution. Using an average (mean) to summarise skewed data leads to misleading conclusions.

2. MTTR Ignores the Human Factors of Incident Response

Complex outages involve decision-making, coordination, and external dependencies, not just "fixing" a technical issue.

For example:

Shifts and handoffs – An issue starting on a Friday night may not be fully resolved until Monday morning.
Third-party dependencies – If a SaaS provider (e.g., AWS) is down, your team may have to wait for their resolution.
Incident complexity – Diagnosing root causes can take longer than applying a fix.

3. MTTR Encourages Bad Incentives

Focusing on MTTR can drive counterproductive behaviours, such as:

Rushing fixes – Teams may deploy quick patches instead of durable solutions.
Blame culture – If leadership demands a low MTTR, teams may be pressured to underreport resolution times.
Neglecting mitigation efforts – If MTTR is the primary focus, teams may overlook ways to reduce impact (e.g., failover mechanisms).

4. MTTR Doesn’t Account for Impact or System Resilience

Imagine two incidents:

Incident A: Takes 30 minutes to resolve, but 100,000 users are affected.
Incident B: Takes 2 hours, but only a handful of users experience minor issues.

MTTR treats both equally, even though Incident A was far worse in terms of real-world impact.

Modern reliability engineering focuses more on impact mitigation than just time-to-fix. A resilient system absorbs failure gracefully, reducing the impact on users even if full recovery takes time.

What Could We Measure Instead?

Given these challenges, what alternative metrics might work better?

1. Time to Mitigate (TTM)

How quickly does the team reduce the impact of an incident?

Instead of measuring full resolution, this metric prioritises restoring functionality, even if full fixes take longer.

2. Service Level Indicators (SLIs) & Service Level Objectives (SLOs)

How are customers actually experiencing reliability?

Latency, availability, error rates, saturation—all provide better insight than resolution time alone.

3. Incident Complexity & Learning Metrics

How well is the team learning from incidents?

Tracking the depth of postmortems and the effectiveness of follow-up actions ensures that issues don’t keep recurring.

4. Adaptive Capacity & Resilience Metrics

How well does the system adapt to unexpected conditions?

Measures such as automatic failover success rates, redundancy effectiveness, and self-healing mechanisms can be more meaningful than repair time.

Does MTTR Still Have Any Value?

MTTR isn’t entirely useless, but it should be used with caution.

When MTTR Might Be Useful

For small, repeatable incidents – If you’re tracking database failovers, it may provide some insight.
As a secondary trend metric – If MTTR is increasing within a specific failure type, it could signal an efficiency problem.
In controlled environments – Where failures are predictable, such as CI/CD pipeline failures.

When to Avoid MTTR

When measuring overall reliability – User experience is more important than raw resolution time.
When incidents are highly variable – Averages hide the important details.
As a KPI for reliability teams – It encourages speed over learning and resilience.

Moving Beyond MTTR

MTTR may have worked in simpler IT environments, but modern distributed, complex systems demand better metrics.

Instead of focusing on how fast we fix things, we should ask:

How well do we mitigate impact?
How do our users experience reliability?
How effectively do we learn from incidents?
How resilient is our system to future failures?

Shifting from reactive repair to proactive resilience is the future of reliability engineering. And that future has no place for outdated metrics like MTTR.

View full post