In the world of Site Reliability Engineering (SRE) and DevOps, few metrics are as widely used, and as widely debated, as Mean Time to Repair (MTTR).
For decades, MTTR has been a staple of incident management, offering a seemingly simple way to track system reliability and response efficiency. The logic appears sound: a lower MTTR suggests a team is getting better at fixing things, while a higher MTTR indicates potential weaknesses in incident response processes.
But in recent years, this metric has come under scrutiny. Thought leaders in resilience engineering and complex systems theory argue that MTTR is not just flawed—it might be actively misleading. During a recent episode of the Google SRE Prodcast (Dec 4, 2024), Casey Rosenthal and John Allspaw went as far as saying:
“We have the math, we have the studies, we have the thought, we have proof that MTTR is horseshit.”
This is a strong claim. But does it hold up? Should SREs and DevOps teams abandon MTTR altogether, or does it still have a place in modern incident management?
In this article, we’ll explore:
Traditionally, MTTR is defined as:
MTTR = Total time to repair incidents / Number of incidents
It is meant to represent the average time it takes a team to restore a system to normal operation after an incident occurs.
Variations of MTTR exist, including:
In many companies, MTTR is treated as a key performance indicator (KPI), often appearing in SLAs (Service Level Agreements) and SLOs (Service Level Objectives).
At first glance, MTTR seems useful because:
For years, MTTR worked well, or at least seemed to, when systems were simpler and failures were more predictable. But in today’s distributed, cloud-native, and highly complex systems, its limitations are becoming increasingly apparent.
Despite its long history, many experts now argue that MTTR is fundamentally broken when applied to modern systems. Here’s why:
MTTR assumes that all incidents are roughly similar, but this couldn’t be further from the truth. Consider two outages:
Averaging these together tells us almost nothing useful. The arithmetic mean doesn’t reflect real-world distributions, where most incidents are short, but a few are long tail events that dominate recovery time.
This is a well-documented statistical problem: incident resolution times follow a power-law distribution, not a normal distribution. Using an average (mean) to summarise skewed data leads to misleading conclusions.
Complex outages involve decision-making, coordination, and external dependencies, not just "fixing" a technical issue.
For example:
Focusing on MTTR can drive counterproductive behaviours, such as:
Imagine two incidents:
MTTR treats both equally, even though Incident A was far worse in terms of real-world impact.
Modern reliability engineering focuses more on impact mitigation than just time-to-fix. A resilient system absorbs failure gracefully, reducing the impact on users even if full recovery takes time.
Given these challenges, what alternative metrics might work better?
How quickly does the team reduce the impact of an incident?
How are customers actually experiencing reliability?
How well is the team learning from incidents?
How well does the system adapt to unexpected conditions?
MTTR isn’t entirely useless, but it should be used with caution.
MTTR may have worked in simpler IT environments, but modern distributed, complex systems demand better metrics.
Instead of focusing on how fast we fix things, we should ask:
Shifting from reactive repair to proactive resilience is the future of reliability engineering. And that future has no place for outdated metrics like MTTR.