Mean time to restore (sometimes called mean time to recovery), or MTTR, describes the average time to recover from a failed deployment, incident, or service outage. It measures the time from the detection of an incident or outage until the full system functionality is restored.
MTTR is a high-level metric that helps you measure the speed of your recovery process and indicates how quickly your system can recover from failure. Generally speaking, MTTR typically relates to unplanned incidents, rather than service requests.
Mean Time to Restore vs. Resolve: What’s the Difference?
Mean time to restore refers to the average time it takes to recover from a product or service failure but does not include additional time taken to ensure that the incident does not happen again.
Mean time to resolve, on the other hand, is the average time needed to restore a system completely, including time to fix the problem and complete any additional work needed to prevent the issue from recurring. This may include failure detection, diagnosis, restoration, and proactive steps taken to harden the system against similar failures in the future.
As a result, mean time to resolve provides insight into the full scope required to resolve the issue beyond the actual downtime, extending the responsibility of the team beyond just fixing the issue to improving the system’s long-term performance.
How to Calculate Mean Time to Restore
Mean time to restore is calculated by adding the total downtime over a specific time period and dividing it by the total number of incidents within that time period.
MTTR = sum of all time to resolve periods / number of incidents
For example, imagine that your system goes down three times within two weeks. If the first incident took two hours to restore, the second incident took four hours, and the third incident took six hours for a total of 12 hours, the MTTR for that two-week period would be:
MTTR = 12 hours of total downtime / 3 incidents
MTTR = 4 hours
What Is a Good Mean Time to Restore?
System outages and downtime heavily impact customer experience, so it’s important for MTTR to be as short as possible. A higher MTTR means the organization and its customers are more likely to experience significant and frequent downtime, which can lead to complaints, cancellations, and non-renewals.
A good MTTR is directly related to how quickly you can detect and identify a problem’s root cause (the mean time to detect, or MTTD). The longer it takes to identify a problem, the longer it will take you to restore the system to full operation.
A low MTTD is the key to reducing MTTR and improving other reliability metrics. If you decrease the time required to detect an issue, you also decrease the time until its resolution. Observability and continuous monitoring play an important role in alerting teams to issues and quickly reducing MTTD.
Besides monitoring, here are a few other ways to reduce MTTR:
- Develop a clearly documented incident management plan that lets teams know how to manage an incident, from the first alert to the point when the system resumes full operation.
- Use automated tools to assign responsibilities, create documents, capture analytics, and manage configurations.
- Clearly define and assign team roles and responsibilities so that everyone knows what to do when an incident occurs.
- Perform postmortems on past incidents to investigate and document the specifics of each issue, how it happened, and how to prevent it in the future.
How to Calculate Mean Time to Resolve
Mean time to resolve (MTTR) differs from mean time to restore because it includes any additional time spent on preventing similar issues from occurring in the future.
To calculate MTTR, add the total time taken to restore the system, including additional time to ensure the issue doesn’t happen again, and divide this number by the total number of incidents. Think of it like this:
MTTR = total incident restoration time + additional time spent ensuring the issue does not recur / number of incidents
Imagine that your system goes down twice in a 48-hour time frame. The first incident lasts for one hour and the second for two hours. Then, the team spends an additional three hours hardening systems to prevent the issues from reoccurring, resulting in a total of six hours.
MTTR = (1 + 2 + 3) hours / 2 incidents
MTTR = 3 hours
What Is a Good Mean Time to Resolve?
Since reducing MTTD reduces mean time to restore, the same actions will also affect the time to complete resolution (mean time to resolve).
Focus can also be given to improving how quickly the team can implement preventative measures. The postmortem from the mean time to restore process, for example, will be especially helpful here, as an in-depth analysis of the issue can reveal helpful insights that can be applied to follow-up activities.
Who Should Use MTTR and When?
Overall, MTTR is a good metric for assessing the speed of your recovery process across several areas of technology. You should use MTTR when you want to improve the average time your team takes to repair assets.
How to Use MTTR in Cybersecurity
MTTR in cybersecurity refers to the time it takes the team to get the system back up and running after a cybersecurity breach. In this way, it shows how fast your security team can return the system and affected customers to their normal operations.
On cybersecurity teams, the MTTR clock typically starts when the team is alerted to a system failure due to a cyberattack.
Here, the restoration process might involve several steps, including containment (to stop the spread of the threat), the actual removal of the threat, and the sanitization of components and resources necessary to restore the system to normal. Once all steps are completed, the system is considered fully restored.
How to Use MTTR in Incident Response
MTTR is a key metric in incident response because it gives insight into the severity of an impact and helps organizations evaluate whether downtime incidents are resolved quickly enough.
In incident response, MTTR is an average of the time that elapses between the reported and resolved time stamps for an issue. Automated tools not only alert teams to incidents but also help them collaborate and communicate more easily, leading to improved MTTR.
Service level objectives (SLO) and service level indicators (SLI) can also be used to measure system reliability and availability and approximate customer satisfaction with a product or service. When an SLO is violated, the mean time to restore the services is the total time to detect, mitigate, and resolve the problem until it again complies with the SLO.
How to Use MTTR in DevOps
In DevOps, MTTR can represent the average time needed to restore an application after a production failure. Measuring MTTR helps teams ensure system resilience and stability, in addition to determining where the response process can be improved.
In DevOps, measuring MTTR often involves the use of monitoring systems to record the start of an incident and when it was resolved (for example, the time to roll back a change or release after it has reached production).
MTTR can also evaluate the performance of the DevOps team. The lower the MTTR of a DevOps team is the better. The Accelerate State of DevOps 2021 report identifies four performance categories for DevOps teams:
- Elite: Less than one hour
- High: Less than 24 hours
- Medium: Less than one week
- Low: More than or equal to one week
A faster MTTR results in lower failure rates, faster delivery, and improved user satisfaction. As DevOps maturity grows, MTTR should fall lower and lower.
What Tools Do You Need to Monitor MTTRs?
To improve MTTR, you need to be able to detect system failures quickly. Continuous monitoring tools, such as Prometheus and Grafana, as well as popular application performance monitoring tools, such as Datadog, Splunk, and Dynatrace, can help you collect MTTR metrics.
These systems use a large amount of real-time and historical data to help you diagnose and analyze issues more quickly. However, to support their complex queries and real-time processing, you’ll need the ultra-fast performance speeds that all-flash storage can provide.
Pure Storage offers several all-flash data storage solutions that provide massive throughput and consistent performance. FlashBlade® is a high-performance file and object storage platform that delivers the speed and performance needed for the application and monitoring tools that support faster MTTD and MTTR.
What Is the Next Metric after MTTR?
While MTTR is a powerful indicator of your ability to react to problems quickly, there are other important reliability metrics you should also be monitoring. Learn more about another critical calculation: mean time before failure (MTBF).