Originally published on Medium.
The world’s technical giants, like PayPal and Alphabet, invest huge amounts of resources into the scalability and stability of their environments; and even they are not immune to critical incidents. All the popular services are susceptible to failure, but they are prepared to effectively deal with it.
Even when you employ passionate developers and experienced DevOps engineers, there’s still no guarantee that your environments will be void of incidents. Preventing everything from breaking is unrealistic and considering your ecosystem as flawless is dangerously arrogant. The reality is that our kind, human beings, are inherently flawed. We make mistakes; and the world would have been a very boring place if that wasn’t the case.
Technical teams need to be adaptive, pragmatic, and creative in order to deal with these incidents effectively. The ability to classify and prioritise is crucial, and juggling skills will be properly challenged.
But ultimately, it is the ecosystem’s responsibility to drive the Incident Management Process.
What is an incident?
In our world it is an event that has compromised the state of a production environment. These events have a significant negative impact on the environment with direct impact to the end users and systems that rely on it.
Incident related situations
- Downtime — the environment hasn’t been available for a certain period of time
- Loss of data — some information has been fully or partially lost, damaged, or corrupted
- Security breach — some confidential information has been exposed
- Bulk communication — an application has sent bulk email messages or other communication by mistake
- Interruption of normal operation — some critical functionality stopped working
- End users — the first who suffer from an incident
- Business owners — might lose money and reputation if the application they are responsible for is not functioning as intended
- Technical team — responsible for technical incidents
Incidents are not
- General bugs in the software
- Performance issues
- Enhancement requests blocking the go-live of projects
Examples of incidents
- A production database is accidentally deleted by a member of the technical team
- A bug in a 3rd party service consumes all the resources of the production server and brings it to a halt
- The renewal of an SSL certificate goes wrong and the application is unreachable
Incident Management Process Flow
There should be a constant stream of communication to the relevant parties (end users, business owners, other technical teams, etc). Each phase in the incident management process marks progress and should be communicated effectively to instil confidence and reduce panic.
Identify — Classify and prioritise
The relevant internal members (technical and non-technical) are responsible to classify and prioritise the incident.
Assign — Create the issue
An issue should be created to track the incident and assign the member/s responsible to investigate and resolve it.
Investigate — Identify the problem/s
Dive into the environment and find the problem. This diagnostic phase will help to classify the severity of the problem and lead to the solution. Communication is essential at this point, the end users and business owners should be notified that the problem has been identified. Be careful in providing an estimated turnaround time at this point, because sometimes the solution might have unexpected consequences.
Resolve — Solve the problem/s
Solve the problem and complete the necessary tests to ensure that the environment is back in a stable state. If the solution did not restore the environment back to a stable state the end users and business owners should still be notified about this fact and assured that the investigation phase will commence again. Avoid communicating failed solutions.
Final communication to the relevant stakeholders should only be considered if the team is confident that all possible problems have been identified and resolved and that the application has been successfully restored to a stable state.
Report — Draft an incident report
Document the findings in an incident report to ensure that all the details about the incident is captured. In some cases it might be required to do a thorough impact analysis, especially when there is a risk of data-loss.
The Incident Report
The incident report is the most important phase in the incident management process, since it promotes transparency and forces continuous improvement.
It has nothing to do with drafting lengthy documentation that no one ever reads. The team should decide when a report is considered to be useful or not.
- Provides a comprehensive report to all the stakeholders
- Promotes transparency between technical and non-technical members
- Discover problem identification formulas and solution techniques
- Identify and prioritise frequently occurring or related problems
- Fosters continuous improvement
All incident reports should follow a predefined structure, ensuring consistency and making the drafting exercise very simple.
A short description of the incident, longer than a tweet but shorter than a blog post. It should be limited to the essential information about the incident, and should answer the following questions: What caused it? How was it resolved? What was the impact?
A detailed report of the exact times of all the events and communications related to the incident.
Detailed description of the actual problem that caused the incident with as many details as possible. Include logs, code snippets, screenshots, etc.
Resolution and recovery
A detailed description of all the actions taken by all the parties involved in diagnosing and resolving the incident, including the actions that were unsuccessful.
Corrective and preventive measures
List all measures that should be considered to prevent similar incidents in the future. Don’t plan the measures or implement them, just document them.
A formal structure will help a lot to ensure incidents are dealt with effectively, but it is definitely not a flawless process. There are some guidelines that should be considered.
A good recipe does not always make for a good meal.
#1 Track everything
This is a very useful skill in any problem solving scenario as it holds many obvious and hidden benefits. In these cases less is not more, but more is more. Each event should be tracked with all the possible details available, including internal and external players, exact times, debugging information, reference links, even a live video recording with a voice over by Sir David Attenborough could be useful.
Fortunately, most communication channels will assist a lot with reflecting back on the events. Try to stick to the tracked channels and make notes about unmonitored calls and conversations.
- Avoid the recursive spiral of revisiting failed solutions
- Identify events that caused other problems
- Drafting the incident report will be effortless
#2 Be completely transparent
Don’t exaggerate or conceal anything. Even the wrong turns should be tracked and reported on, otherwise you won’t be able to reflect on the journey. Details that are omitted might encumber not only the solutions, but also the subsequent corrective and preventive measures.
#3 Don’t point fingers
The focus should be on solving the problem and not on placing blame. The goal is to identify to solve and prevent. These events are mostly caused by poor systems or processes and have multiple layers of responsibility. Identifying the flaws in the system rather than the individuals will go a long way in preventing them in the future.
#4 Don’t draft the incident report prematurely
It might feel natural to draft the report while the incident is being resolved, but that should be avoided. While in the midst of problem solving your perspective will be influenced and that might lead into inaccurate and biased findings.
The report should not influence the solution, rather the other way around.