The purpose of a root cause analysis in either the IT operations or information security fields is to gain insight into the source(s) of a problem with the goal of preventing recurrence. A root cause analysis should be performed after an incident has been responded to and not during. During the incident individuals should not be distracted and the primary focus of all involved should be on the restoration of service and the elimination of business impact. Once business operations have returned to normal the next steps should be to collect any relevant information and do a debrief in preparation for a formal root cause analysis.
Organizations often make use of the 5 Why method to determine how an incident occurred. Asking the question why several times helps to effectively drill down to what caused a problem vs. simply stating the problem itself.
Example of the 5 Why’s Method:
The company file and print server was infected with a worm <- Why?
The server was not patched with the latest Microsoft patches <- Why?
The automated server to deploy the patches has been broken a month and is not operational < – Why?
The change to upgrade it 5 weeks ago was unsuccessful and no additional action was performed to correct < –Why?
The change to the server was not properly planned or documented and the engineers were unaware that the upgrade activity had occurred. <-Why
Proper change control processes were not followed
The only time you are likely to hear more whys is in a car with several small children who you are trying to explain something to.
Tips for conducting an effective root cause analysis
- A root cause analysis should be performed as soon after an incident as is practical to allow for the needed prework and attendees to be scheduled. Extended delays increase the likelihood and incident will not be well remembered and the momentum to correct may be lost.
- Conduct sufficient prework to document the incident and actions taken during the event. Review the documentation with those involved for factual accuracy.
- Schedule the root cause analysis so that key individuals are available to attend
- A standardized form should be utilized when conducting a root cause analysis. Ideally this information will be stored in an application or database so that metrics are easily generated to allow for long term improvement tracking.
Common errors that occur in the root cause analysis process
- Failing to properly document the facts around the incident in a timely manner
- Failing to understand the difference between correlated facts and causation
- Not driving to a deep enough level and simply recording what happened vs. why it happened and how it can be prevented.
- Not tracking improvement tasks to make sure they have been completed as expected
- Not auditing the root cause analysis process for quality
Tips if root cause can not be determined
- Determine what additional information should be collected next time and develop a process for collecting the needed information in case the event reoccurs.
- Do not just assign a root cause if it is not correct for the false sense of completeness. Recognize that not all incidents can be attributed to a root cause first pass go around and make a plan to be effective if the issue recurs.
Sample Root Cause Analysis Form
Statement of issue: Describe the problem that occurred
Chronology of events: Detail events that occurred with specific timelines and actions taken during the incident
Business Impact: Define and quantify the problem from a business perspective
Participants: Document individuals that participated in the root cause analysis
Corrective Actions with individuals name responsible for completing and date completed:
Lessons Learned: Document to enable future improvements
Other areas with similar exposure: Document so same incident does not have to be experienced multiple times in different operating areas
Contributing Causes: Items may not be root cause but were contributing factors that need correction
Was the incident a repeat event?
Final thoughts on root cause analysis
If you are capturing your root cause analysis in a database it may be useful to track many other items for reporting and improvement metrics. Some of these items might include:
- Incident # (to link back to your problem management system)
- Incident Status
- Incident Start and End Time
- Location/Country/Region of incident
- Incident category (application/server/etc.)
- Service affected
- Organization owner of incident
- Type of problem
Effectively performing a root cause analysis is one of the most important things you can do to improve operations and drive a continuous operations improvement mindset.