Why Mean Time to Repair Is Not Always A Useful Security Metric
Security teams have traditionally used mean time to repair (MTTR) as a way to measure how effectively they are handling security incidents. However, variations in incident severity, team agility, and system complexity may make that security metric less useful, says Courtney Nash, lead research analyst at Verica and main author of the Open Incident Database (VOID) report.
MTTR originated in manufacturing organizations and was a measure of the average time required to repair a failed physical component or device. These devices had simpler, predictable operations with wear and tear that lent themselves to reasonably standard and consistent estimates of MTTR. Over time the use of MTTR has expanded to software systems, and software companies began using it as an indicator of system reliability and team agility or effectiveness.
Unfortunately, Nash says, its variability means that MTTR could either lead to false confidence or cause unnecessary concern.
“It’s not an appropriate metric for complex software systems, in part because of the skewed distribution of duration data and because failures in such systems don’t arrive uniformly over time,” Nash says. “Each failure is inherently different, unlike issues with physical manufacturing devices.”
Moving Away From MTTR
“[MTTR] tells us little about what an incident is really like for the organization, which can vary wildly in terms of the number of people and teams involved, the level of stress, what is needed technically and organizationally to fix it, and what the team learned as a result,” Nash says.
MTTR falls victim to the oversimplification of incidents because it is calculating an average — the average time, says Nora Jones, CEO and co-founder of Jeli. Simply measuring this single average of reported times (and those reported times have also been proven to not be reliable in the first place) inhibits organizations from seeing and addressing what’s going on within the infrastructure, what’s contributing to that recurring incident, and how people are responding to incidents.
“Incidents come in all shapes and size — you’ll see them span the complete range in severity, impact to customers, and resolution complexity all within one organization,” Jones explains. “You really have to look at the people and tools together and take a qualitative approach to incident analysis.”
However, Nash says moving away from MTTR isn’t an overnight shift — it’s not as simple as just swapping one metric for another.
“At the end of the day, it’s being honest about the contributing factors, and the role that people play in coming up with solutions,” she says. “It sounds simple, but it takes time, and these are the concrete activities that will build better metrics.”
Broadening the Use of Metrics
Nash says analyzing and learning from incidents is the ideal path to finding more insightful data and metrics. A team can collect things like the number of people involved hands-on in an incident; how many unique teams were involved; which tools people used; how many chat channels there were; and if there were concurrent incidents.
As an organization gets better at conducting incident reviews and learning from them, it will start to see traction in things like the number of people attending post-incident review meetings, increased reading and sharing of post-incident reports, and using those reports in things like code reviews, training, and onboarding.
David Severski, senior security data scientist at the Cyentia Institute, says when working on the Verizon DBIR, Cyentia created and released the Vocabulary for Event Reporting and Incident Sharing to expand the types of metrics used to measure an incident.
“It defines data points we think are important to collect on security incidents,” he says. “We still use this basic template in Cyentia research with some updates, for example identifying ATT&CK TTPs utilized.”
The metrics for measuring an incident is not a one-size-fits-all across organization sizes and types. “Teams understand where they are today, assess where their priorities are within their current constraints, and understand their focus metrics might even evolve over time as their organization develops and scales,” Jones says.
Additionally, it’s about shifting focus to learnings, and then continuously improving based on those learnings, for example shifting to assessing trends and if things are trending in the right direction over time, as opposed to single-point-in-time metrics.
Read More HERE