PagerDuty Inc.

31/07/2024 | News release | Distributed by Public on 31/07/2024 14:36

Reducing Coordination Costs in Incident Response

Incidents can happen anywhere at any time. They can be small, well-defined, and easily contained. They can be large, messy, and complex, like the major outage we saw recently. Or they can be somewhere in between. When incidents occur, mobilizing and coordinating responders is crucial to restoring service, protecting the customer experience, and mitigating business risks.

Beyond impact to customers, service outages and degradations also have financial implications for an organization. From loss of revenue to reputational damage with customers and capital markets, and potential compliance fines and penalties. Incidents are expensive! Our researchshows that the average incident lasts nearly three hours at an estimated cost of $4,537 per minute, or close to $794,000 per incident. That doesn't even consider the damage to the brand reputation for the company.

During an active incident, time is money and we want to decrease the impact. Where responders meet and coordinate to respond to an incident plays a key role in the efficiency of the process and the speed of the recovery.

Establish familiar locations
We plan for when incidents happen, not if they will happen. However, incidents are inevitable! Preparing a contingency plan for your team in advance will improve response times and build confidence in your responders.

If you have ever worked or gone to school in a public building, you've probably participated in a fire drill or some other type of emergency preparedness training. When an alarm sounds, everyone practices what they will do in a real emergency - follow the appropriate exit routes, meet with their team or classmates in a designated location outside the building while a coordinator ensures everyone is where they are meant to be.

Your incident response should be similar. Responders should know what to do before an incident ever occurs - where they should meet with each other to work on troubleshooting and remediating the incident. Your team will likely to do this in your chat application of choice. Integrating a Slack workspace with PagerDuty gives your team all the places they need to coordinate response, not just for responders but for the rest of the organization.

Responders coordinate in Slack
Anything that speeds up a response process, reduces friction for responders, or alleviates confusion during an incident will lower the overall costs associated with that incident. Coordinating responders using methods they are already familiar with accomplishes these goals.

For teams accustomed to working primarily in a chat environment, such as a Slack workspace, jumping into another environment solely for incident response could impede their ability to quickly engage with the incident. PagerDuty users with a Slack integration can trigger, track, escalate, and resolve incidents right from their existing Slack channels.

Following are some of the many benefits of Slack integration:

  • Quick Incident Reporting: Streamline incident initiation through both automated telemetry and manual human observation, enabling swift response.
  • Efficient Team Coordination: Teams can use dedicated channels for specific incidents or create on-demand channels for complex, multi-team situations. A static major incident response channel ensures consistent handling of significant events.
  • Rapid Responder Addition: Quickly add subject matter experts (SMEs) and other responders directly from the Slack channel, ensuring timely involvement and minimizing misdirection.
  • Clear Role Assignments: Easily assign roles such as Incident Commander and Scribe, ensuring clarity and continuity even during long incidents or personnel changes.
  • Integrated Actions and Updates: Perform status updates and automation actions within Slack, keeping all team members informed and fostering collaborative troubleshooting.
  • Comprehensive Post-Incident Reviews: All incident data and conversations are automatically recorded, facilitating thorough post-incident reviews and improvements to future incident responses.

Measuring out to stakeholders
Incidents can disrupt the entire organization, not just the responders handling the issue. Key individuals - the marketing director delaying an email campaign, or the sales engineer opting for a recorded demo over a live one - often need to stay informed even if they're not directly involved in the response.

Large incidents with a wide "blast radius" can derail productivity across the company for hours or days. While this makes for amusing xkcd comics, it's not great for your goals. It's also not a good use of time for dozens of non-responders to be idling in response channels just in case something happens.

Organizations need clear communication channels to keep all stakeholders informed during long-running incidents without disrupting response efforts. Providing regular updates in a designated location, like a status page or a dedicated Slack channel, ensures everyone is up to date without interfering with their other responsibilities. This includes executive stakeholders, who can receive active notifications about status changes, and customers who will appreciate timely updates to alleviate concerns and reduce unnecessary support inquiries.

Linking these methods to a single Status Update in PagerDuty reduces the cognitive load on the responding team. They don't need to remember multiple locations, multiple logins, which channels to update, which email lists to inform, or any number of other distracting details.

Information is power. An integrated, coordinated incident response is a powerful way to keep everyone informed and ensure a smoother, more coordinated effort across your organization.

Learn more about PagerDuty's Incident Management solution.