Home United States USA — software What are MTTx Metrics Good For?

What are MTTx Metrics Good For?

255
0
SHARE

MTTx metrics rarely tell the whole story of a system. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data.
Join the DZone community and get the full member experience. Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process. Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover: For each metric, trends can help suggest where to work on improvement. For example, if the MTTD is increasing, you might work to improve your monitoring. But, MTTx metrics alone are insufficient to identify trends in reliability. In an experiment detailed in the ebook Incident Metrics in SRE, author Štěpán Davidovič ran simulations of multiple systems with varying incident frequencies and durations. He generated sets of hypothetical data and compared the MTTx metrics from each. The goal was to determine if changes made to improve MTTx metrics (such as buying a tool) would reflect in the system. The findings were conclusive: “MTTx metrics will probably mislead you.” As the experiment stated, “Even though in the simulation the improvement always worked,38% of the simulations had the MTTR difference fall below zero for Company A,40% for Company B, and 20% for Company C. Looking at the absolute change in MTTR, the probability of seeing at least a 15-minute improvement is only 49%,50%, and 64%, respectively. Even though the product in the scenario worked and shortened incidents, the odds of detecting any improvement at all are well outside the tolerance of 10% random flukes. ” This means that even if your tool or process improvement is working, you may not even be able to detect it.

Continue reading...