Bot for Incident Management

Bot for Incident Management

Check this

During incidents you loose money every second(amazon looses almost 220k per minute). At many companies the incident handling system is mostly manual and no doubt error-prone. So question arises can we make it managed by a bot. The answer is yes and we can partially automate the whole process. Slack bot can be used to manage all the steps during an incident, it won't forget the steps(ensuring the team follows the pre-defined playbook), will save a lot of time by automating few of the tasks. All the tools that I have mentioned here may vary but the idea of creating the bot will remain the same. So here are the few points to ponder about while building the chat bot :-

  • Workflow should involve few kind of statuses in a certain flow such as :- open, investigating, fixing, fixed, postmortem, closed. At least 4 or 5 levels of severity should be there based on the effect and incident can cause and postmortem should be done only for only few of the high level incidents. It should have few roles also as in :- Incident commander, Communication lead, Operations lead.

  • When a real incident happens the bot should initiate an incident in particular channel in slack, Pagerduty which will hold the name, email of respective folks should call them to join the channel with required documents and to fill a form. For small incidents it should close it automatically and for the major issues it should inform on-call folks to start a discussion for better management.

  • Once the form is submitted, the bot should post the same incident in the global channel, should ask specific channels to join for further help. should activate Pagerduty, paste the incident on status page, activate Datadog to keep notes for this event.

  • Postmortem is a collaborative process that comes up with culture of honesty, learning and accountability. The bot should create a file in Confluence with proper naming convention and a certain template(that inculdes date of the incident, when the incident occurred, attendees, the timeline people and roles) for postmortem, where everyone involved in the incident should be invited to contribute.

  • Similarly while closing the incident also, the all the status with different tools should change automatically or via simple commands in reverse order the way they were created by the bot, once the incident is closed properly. The status, postmortem and almost all the details as a summary to all the whole team.

thanks for reading up to here😊

sources : - Medium Blog, Sendinblue Blog