The primary purpose of an incident management process in the IT operations or security fields is to quickly restore normal service operations to minimize the impact on normal business operations. Here is a rundown of a typical incident response situation:
1. Operations business critical or security related incident is reported to the help desk by an end user or automated monitoring system. It is important for the help desk to get detailed information about the exact nature of the problem including a detailed problem statement of what is not working. The help desk should document the specifics into the problem log for documentation purposes.
2. Help desk reviews issue and support scripts and determines the business impact of the issue and if the issue should be escalated as a high priority item
3. Help desk follows documented escalation process and begins to form a system restoration team (as detailed by the application/system support script)
4. System restoration team assembles on a designated global phone bridge with intent of getting all people necessary for system restoration.
5. For a high priority application type problem without a clearly defined problem it is typical to get the end to end support team on the line. Typical system restoration participants are
- Application support team member
- Server support team member
- Database support team member
- Network/Firewall team member
- Someone who can test functionality/items as needed (often a business user)
- Facilitator of the incident response call
6. Facts surrounding the event are discussed with the combined team so everyone is aligned on the problem that needs to be solved. The incident response facilitator should be the primary voice of the system response team and keep the team on track with the primary goal to restore normal business operations
7. Depending on the severity of the problem it is important to keep relevant stakeholders updated to the progress and expected duration of the problem (if known). Communicating effectively is one of the most important things that needs to be done during an incident to set proper expectations and keep those affected informed. Effective communication is one of the key things that can be done to help minimize the likelihood of unneeded political escalation of the event.
8. It is best practice to keep the phone bridge open until the problem is resolved to maintain problem solving momentum. If the problem is expected to run too long to make that practical it is good to define the needed update times and schedule the sessions as needed.
9. It is important to validate that the service has been restored to normal prior to disbanding the system restoration team. This is best done by validating with an end user on the bridge.
10. Before terminating the call the team should make sure the incident diary is updated with information about what was done to resolve the problem. In addition, any information needed for the RCCA should be assembled while the incident is still fresh in everyone’s mind.
Important points about the Incident Management process
- There is sometimes a tradeoff between quicker restoration vs. collecting system log and other information in event to find a root cause of the problem. This conflict should be managed appropriately depending on the likelihood of finding a true root cause (which is very desirable to prevent future problems) vs. faster restoration of the affected service.
- It is important that the system restoration team facilitator be in charge of leading the assembled resources to maintain an orderly process. Too many chefs in the kitchen will not help restore service in a more timely manner.
- Documenting the problem ticket regularly through the process is important for tracking status, communicating updates, and as a source of data for the future root cause analysis.
- Opening a group chat room for the system restoration team is a good way to share technical information without sidetracking the phone bridge directing resolution of the problem. It also serves as a nice log for the problem diary and a potential source of information for the root cause analysis.