Tag Archives: Incident Management

How to run an incident management process

The primary purpose of an incident management process in the IT operations or security fields is to quickly restore normal service operations to minimize the impact on normal business operations. Here is a rundown of a typical incident response situation:

1. Operations business critical or security related incident is reported to the help desk by an end user or automated monitoring system. It is important for the help desk to get detailed information about the exact nature of the problem including a detailed problem statement of what is not working. The help desk should document the specifics into the problem log for documentation purposes.

2. Help desk reviews issue and support scripts and determines the business impact of the issue and if the issue should be escalated as a high priority item

3. Help desk follows documented escalation process and begins to form a system restoration team (as detailed by the application/system support script)

4. System restoration team assembles on a designated global phone bridge with intent of getting all people necessary for system restoration.

5. For a high priority application type problem without a clearly defined problem it is typical to get the end to end support team on the line. Typical system restoration participants are

  • Application support team member
  • Server support team member
  • Database support team member
  • Network/Firewall team member
  • Someone who can test functionality/items as needed (often a business user)
  • Facilitator of the incident response call

6. Facts surrounding the event are discussed with the combined team so everyone is aligned on the problem that needs to be solved. The incident response facilitator should be the primary voice of the system response team and keep the team on track with the primary goal to restore normal business operations

7. Depending on the severity of the problem it is important to keep relevant stakeholders updated to the progress and expected duration of the problem (if known). Communicating effectively is one of the most important things that needs to be done during an incident to set proper expectations and keep those affected informed. Effective communication is one of the key things that can be done to help minimize the likelihood of unneeded political escalation of the event.

8. It is best practice to keep the phone bridge open until the problem is resolved to maintain problem solving momentum. If the problem is expected to run too long to make that practical it is good to define the needed update times and schedule the sessions as needed.

9. It is important to validate that the service has been restored to normal prior to disbanding the system restoration team. This is best done by validating with an end user on the bridge.

10. Before terminating the call the team should make sure the incident diary is updated with information about what was done to resolve the problem. In addition, any information needed for the RCCA should be assembled while the incident is still fresh in everyone’s mind.

Important points about the Incident Management process

  • There is sometimes a tradeoff between quicker restoration vs. collecting system log and other information in event to find a root cause of the problem. This conflict should be managed appropriately depending on the likelihood of finding a true root cause (which is very desirable to prevent future problems) vs. faster restoration of the affected service.
  • It is important that the system restoration team facilitator be in charge of leading the assembled resources to maintain an orderly process. Too many chefs in the kitchen will not help restore service in a more timely manner.
  • Documenting the problem ticket regularly through the process is important for tracking status, communicating updates, and as a source of data for the future root cause analysis.
  • Opening a group chat room for the system restoration team is a good way to share technical information without sidetracking the phone bridge directing resolution of the problem. It also serves as a nice log for the problem diary and a potential source of information for the root cause analysis.

How to perform a root cause analysis?

The purpose of a root cause analysis in either the IT operations or information security fields is to gain insight into the source(s) of a problem with the goal of preventing recurrence. A root cause analysis should be performed after an incident has been responded to and not during. During the incident individuals should not be distracted and the primary focus of all involved should be on the restoration of service and the elimination of business impact. Once business operations have returned to normal the next steps should be to collect any relevant information and do a debrief in preparation for a formal root cause analysis.

Organizations often make use of the 5 Why method to determine how an incident occurred. Asking the question why several times helps to effectively drill down to what caused a problem vs. simply stating the problem itself.

Example of the 5 Why’s Method:

The company file and print server was infected with a worm <- Why?

The server was not patched with the latest Microsoft patches <- Why?

The automated server to deploy the patches has been broken a month and is not operational < – Why?

The change to upgrade it 5 weeks ago was unsuccessful and no additional action was performed to correct < –Why?

The change to the server was not properly planned or documented and the engineers were unaware that the upgrade activity had occurred. <-Why

Proper change control processes were not followed

The only time you are likely to hear more whys is in a car with several small children who you are trying to explain something to.

Tips for conducting an effective root cause analysis

  • A root cause analysis should be performed as soon after an incident as is practical to allow for the needed prework and attendees to be scheduled. Extended delays increase the likelihood and incident will not be well remembered and the momentum to correct may be lost.
  • Conduct sufficient prework to document the incident and actions taken during the event. Review the documentation with those involved for factual accuracy.
  • Schedule the root cause analysis so that key individuals are available to attend
  • A standardized form should be utilized when conducting a root cause analysis. Ideally this information will be stored in an application or database so that metrics are easily generated to allow for long term improvement tracking.

Common errors that occur in the root cause analysis process

  • Failing to properly document the facts around the incident in a timely manner
  • Failing to understand the difference between correlated facts and causation
  • Not driving to a deep enough level and simply recording what happened vs. why it happened and how it can be prevented.
  • Not tracking improvement tasks to make sure they have been completed as expected
  • Not auditing the root cause analysis process for quality

Tips if root cause can not be determined

  • Determine what additional information should be collected next time and develop a process for collecting the needed information in case¬† the event reoccurs.¬†
  • Do not just assign a root cause if it is not correct for the false sense of completeness. Recognize that not all incidents can be attributed to a root cause first pass go around and make a plan to be effective if the issue recurs.

Sample Root Cause Analysis Form

Statement of issue: Describe the problem that occurred

Chronology of events: Detail events that occurred with specific timelines and actions taken during the incident

Business Impact: Define and quantify the problem from a business perspective

Participants: Document individuals that participated in the root cause analysis

Corrective Actions with individuals name responsible for completing and date completed:

Lessons Learned: Document to enable future improvements

Other areas with similar exposure: Document so same incident does not have to be experienced multiple times in different operating areas

Contributing Causes: Items may not be root cause but were contributing factors that need correction

Was the incident a repeat event?

Final thoughts on root cause analysis

If you are capturing your root cause analysis in a database it may be useful to track many other items for reporting and improvement metrics. Some of these items might include:

  • Incident # (to link back to your problem management system)
  • Incident Status
  • Incident Start and End Time
  • Location/Country/Region of incident
  • Incident category (application/server/etc.)
  • Service affected
  • Organization owner of incident
  • Type of problem

Effectively performing a root cause analysis is one of the most important things you can do to improve operations and drive a continuous operations improvement mindset.