08/06/2024 | News release | Distributed by Public on 08/06/2024 08:03
The recent global outage reminds us that identifying issues and their impact radius is just the first part of a lengthy process to remediation. Incidents are inevitable; how we prepare for and learn from them is what sets teams up to respond more effectively next time. As we saw from the remediation steps taken by enterprises around the world, implementing a known fix across a large number of environments that are potentially managed by a number of distributed teams can be a gargantuan challenge. Having optimal incident management practices coupled with centralized automation standards can often alleviate some of the pain of these types of incidents. This blog dives into the benefits and challenges of having a centralized automation practice to IT operations.
___________________________________________________________________________
When speaking with IT and Engineering practitioners and leaders across a variety of industries, one recurring theme is the challenge of implementing effective automation strategies at scale. Let's explore how organizations can strike the right balance between centralization and decentralization when it comes to automation.
The centralization versus. decentralization dilemma
Across business and technical operations, there's an inherent tension between centralizing functions for standardization and control, and decentralizing for agility and innovation. This is particularly evident in the realm of automation, where teams constantly seek ways to improve efficiency and reduce manual workloads.
Centralization offers benefits such as:
On the other hand, decentralization provides:
An excellent article on this discussion by Alix Partners lays out many of the pros and cons for each approach:
The Automation landscape in modern organizations
In today's IT and software development environments, automation has become ubiquitous. It spans a wide range of activities, from incident response and reliability management to provisioning and reporting. However, the decentralized nature of most organizations has led to a proliferation of diverse automation tools and practices across different teams.
This diversity stems from various factors:
In addition to the diversity of technical automation, each team may have variances in the "higher level" or "business" processes surrounding these tasks or the use of automation. For example, some teams may require that certain tasks require sign-off or approval by one or more individuals, while others do not. Or, some teams mandate that all automation is logged in an ITSM or that notifications are sent through chat - such as Slack or MS Teams.
While this decentralized approach can drive innovation and speed, it also presents challenges in implementing department or organization-wide standards, particularly in areas such as:
There are also prime incident-management cases where a fix needs to be implemented across all environments in the organization, such as for the Cyberark Falcon Agent outage earlier this month. When there are numerous environments, all with diverse methods for performing operations tasks, the time to apply a fix for a service disruption grows substantially.
The impact of Generative AI on automation
Generative AI is transforming the automation landscape by significantly increasing the velocity of automation creation. Tools like Github Copilot, ChatGPT, and PagerDuty Advance enable users to generate scripts and playbooks rapidly, reducing development time and accelerating deployment. However, this rapid pace of innovation further exemplifies the challenges and risks with decentralization for automation:
Finding the Right Balance
Based on our observations at PagerDuty, we've found that as companies grow, they often benefit from establishing a centralized team or function focused on automation. However, the key is to strike a balance that doesn't impede individual teams' velocity, motivation, and innovation.
Here are some strategies we've seen successful organizations employ:
Orchestration platform can help implement automation standards while still giving teams autonomy to use their own tools
A real-world example: shifting left while maintaining standards
Many of our customers are working to "shift left" by empowering their development teams to implement runbooks as part of their service ownership. At the same time, they want to provide standardized auto-remediation capabilities for their Level 1 support teams during incident response. Not only will this free up developer time to focus on high-value work, but it will also allow support teams to take action when they need it, rather than waiting on experts to acknowledge and execute.
To achieve this balance, these organizations are leveraging centralized platforms that can orchestrate automation written by dev teams while providing a standardized interface for L1 responders. This approach allows for:
Preparing for the future
By finding the right balance between centralization and autonomy in automation, organizations not only optimize their current operations but also build resilience for future technological shifts-while minimizing risk for future cascading outages. When the next wave of innovation hits, teams will be better prepared to adopt new tools and practices within a flexible yet standardized framework.
As CIOs and technology leaders, the challenge is to create an environment that fosters innovation and agility while maintaining the necessary controls and standards to mitigate risk in the long run. By leveraging the right platforms and team structures for incident management and automation orchestration, you can achieve this balance and position your organization for long-term success in an increasingly automated world.
At PagerDuty, we're committed to helping organizations navigate these challenges and build resilient and scalable automation strategies. I encourage you to explore how our solutions can support your automation journey and help you strike the right balance for your organization.