Start United States USA — software Overview of Incident Lifecycle in SRE

Overview of Incident Lifecycle in SRE

Von

March 2, 2021

217

Get an understanding of best practices to follow during the incident lifecycle so you can handle critical incidents in your organization more easily.
Join the DZone community and get the full member experience. As the saying goes, “Every problem we face is a blessing in disguise.” Along similar lines, every incident in system infrastructure helps product development & engineering teams understand better about the capabilities of system architecture. This can further help organizations in building a sustainable and reliable product. In this blog, let’s quantify all complexities of handling an incident in a well-structured format with an intent to help handle every incident effectively. ITIL 2011 defines Incident as: “An unplanned interruption to an IT service or reduction in the quality of an IT service or a failure of a Configuration Item that has not yet impacted an IT service [but has potential to do so]” Clearly, in order to maintain acceptable service levels, it is important to resolve incidents and restore normal services as quickly as possible. ITIL defines a standard lifecycle of an incident. While the actual activities that occur during each phase have changed over time, it is still a good starting point for a detailed description of incidents. Incidents are identified through reports from monitoring systems or by manual identification. Once an incident is identified, it is logged. An incident log can be used to validate that all incidents have been addressed and to identify trends. At this point, the incident is categorized by adding additional information like severity, functional area, and ownership. These three activities were once the responsibility of a first-level monitoring technician, nowadays they are normally automated. This stage is about notifying the right people to address an incident. In many modern environments identifying the correct responders can be a complex process. Similarly, many organizations have elaborate escalation processes to get specialists or SMEs when the initial responders need help. Modern incident management systems can reduce turnaround times by using rules to automate this. Once notified, incident responders gather information about the incident using observability tools. In addition to the current state of the system, RCAs of similar incidents in the past can be valuable sources of data. This information is used to build a hypothesis about the probable cause of the incident and to decide on a fix. The responder team applies the fix proposed in the previous step and, typically, observes the system for a little while to confirm that the incident has been resolved. Normally, it can take several iterations of trial and error before an incident is resolved. Each trial provides more information to evolve the hypothesis and formulate better fixes. The description of the phases of an incident gives the impression of a structured, systematic engineering process that is calmly applied by experts. However, reality is rarely so neat and clean. Incidents, particularly major ones, are more akin to a battle than an engineering process. Everyone is under pressure, failure has catastrophic consequences and there is always insufficient information to understand what is really happening.