What is an incident?
Lifecycle and status
An incident is a problem or an issue that is generally triggered by technical components of an organization's infrastructure. They can have widespread impact on the organization and can result in major business losses. If they are not attended to fast the impact will keep on aggravating. It is imperative that companies are made aware of issues in real time through an incident response service like TaskCall and can leverage on its features to initiate a rapid response.
Incidents can occur on a service or can be assigned to escalation policies directly. Once alerts are received, TaskCall immediately determines who the correct on-call responders are and send notifications to them through email, push notifications, SMS, voice calls and any chat-ops tools that they may be integrated to. TaskCall will keep notifiying the responders persistently and escalate to the secondary support if need be to ensure that the incident is attended to.
Each organization can have its own guidelines on how an incident should be handled, but for those who are just getting started, here is an example of the most basic incident handling flow that responders can follow:
- Acknowledge the incident once you receive the notifications. This will let others know that you are working on the incident and that you have it under control.
- Snooze the incident if you need more time to work on it.
- Resolve the incident once you are done fixing the issue.
Ofcourse, the above example is a very basic flow. The complexity of the incidents can clearly call for more actions to be taken such as escalating, reassigning, adding responders, posting status updates to keep stakeholders informed, etc. To learn more about the broad range of actions that can be taken please read our documentation on incident actions.
Issues are detected by monitoring tools in your system and are passed to TaskCall through our incidents API or integrations. As soon as the alerts are received, TaskCall triggers an incident based on the details of the alert. Incidents can also be triggered manually by users from the web app, mobile app, by email or from chat-ops tools that the organization may have integrations for. Incidents can also be manually configured to trigger at a later time through pre-scheduled alerts.
The on-call responders have a few minutes at hand to acknowledge the incident. If the incident is not acknowledged by then, it re-triggers or escalates to the next level as per the policy. The number of minutes that TaskCall waits to bring the incident to the forefront again is determined by the number of wait minutes set in the escalation policy.
If the incident is acknowledged then the responder will have several minutes to work on it before TaskCall will re-trigger the notifications for this alert. The number of wait minutes in this case is determined by the settings on the service the incident was triggered on. If the incident was not triggered on a service, then the default wait minutes of 15 minutes will be used.
Once the issue that triggered the incident has been fixed, the incident should be resolved. This will stop any more notifications from being sent out. Resolving an incident cannot be undone.
An incident can only have one of 3 possible statuses:
- Open - An incident will be in its “open” state when the current assignees or responders are yet to acknowledge it. It means that somebody still needs to look at the issue and attend to it. Incidents that have altered states due to prior acknowledgement, but have not been resolved within the stipulated re-triggering time will also be reverted to the “open” state. Open incidents are marked in red in the list where they are displayed.
- Acknowledged - Incidents are updated to this state when a responder has acknowledged it. This state implies that the triggered alerts have been noticed and the incident is being worked on. Apart from acknowledging the incident directly, snoozing it would also automatically acknowledge it. Such status would pertain for a given number of minutes after which alerts will be re-triggered and the incident will revert to an “open” state as mentioned above. Acknowledged incidents are marked in yellow in the list where they are displayed.
- Resolved - Incidents that have been resolved completely are labelled with the “resolved” status. Responders should only “resolve” an incident once they are finished working on it because no notifications are sent out for the incident after it is resolved. Resolved incidents are marked in green where they are displayed.