PagerDuty Inc.

08/06/2024 | News release | Distributed by Public on 08/06/2024 08:03

Balancing Centralization and Autonomy: The Key to Automation at Scale

The recent global outage reminds us that identifying issues and their impact radius is just the first part of a lengthy process to remediation. Incidents are inevitable; how we prepare for and learn from them is what sets teams up to respond more effectively next time. As we saw from the remediation steps taken by enterprises around the world, implementing a known fix across a large number of environments that are potentially managed by a number of distributed teams can be a gargantuan challenge. Having optimal incident management practices coupled with centralized automation standards can often alleviate some of the pain of these types of incidents. This blog dives into the benefits and challenges of having a centralized automation practice to IT operations.
___________________________________________________________________________

When speaking with IT and Engineering practitioners and leaders across a variety of industries, one recurring theme is the challenge of implementing effective automation strategies at scale. Let's explore how organizations can strike the right balance between centralization and decentralization when it comes to automation.

The centralization versus. decentralization dilemma

Across business and technical operations, there's an inherent tension between centralizing functions for standardization and control, and decentralizing for agility and innovation. This is particularly evident in the realm of automation, where teams constantly seek ways to improve efficiency and reduce manual workloads.

Centralization offers benefits such as:

  • Easier implementation of controls and guardrails
  • Holistic visibility for leadership
  • Streamlined implementation of new standards

On the other hand, decentralization provides:

  • Team autonomy to adopt specialized processes and tools
  • Greater velocity in decision-making and execution
  • Flexibility to use "best-of-breed" solutions for specific tasks

An excellent article on this discussion by Alix Partners lays out many of the pros and cons for each approach:

The Automation landscape in modern organizations

In today's IT and software development environments, automation has become ubiquitous. It spans a wide range of activities, from incident response and reliability management to provisioning and reporting. However, the decentralized nature of most organizations has led to a proliferation of diverse automation tools and practices across different teams.

This diversity stems from various factors:

  • Heterogeneous technology stacks: VMs versus containers; different database systems
  • Varied skill sets and preferences among team members: some teams prefer to write automation in Python scripts, whereas others prefer Ansible playbooks.
  • Distinct responsibilities and processes for different teams

In addition to the diversity of technical automation, each team may have variances in the "higher level" or "business" processes surrounding these tasks or the use of automation. For example, some teams may require that certain tasks require sign-off or approval by one or more individuals, while others do not. Or, some teams mandate that all automation is logged in an ITSM or that notifications are sent through chat - such as Slack or MS Teams.

While this decentralized approach can drive innovation and speed, it also presents challenges in implementing department or organization-wide standards, particularly in areas such as:

  • Self-service capabilities
  • Compliance and auditing
  • Security and access control
  • Change management and review processes
  • Integration with business systems

There are also prime incident-management cases where a fix needs to be implemented across all environments in the organization, such as for the Cyberark Falcon Agent outage earlier this month. When there are numerous environments, all with diverse methods for performing operations tasks, the time to apply a fix for a service disruption grows substantially.

The impact of Generative AI on automation

Generative AI is transforming the automation landscape by significantly increasing the velocity of automation creation. Tools like Github Copilot, ChatGPT, and PagerDuty Advance enable users to generate scripts and playbooks rapidly, reducing development time and accelerating deployment. However, this rapid pace of innovation further exemplifies the challenges and risks with decentralization for automation:

  • Negligent Attention to Security: Average business users empowered by AI may not have the same security awareness as seasoned developers, leading to potential vulnerabilities.
  • Lackluster Credential Management: AI-generated automations might not adhere to stringent credential management practices, increasing the risk of unauthorized access.
  • Increased Risk of Non-Compliance: Without proper oversight, AI-generated automations may fail to comply with data privacy regulations such as GDPR or HIPAA, leading to significant compliance risks.

Finding the Right Balance

Based on our observations at PagerDuty, we've found that as companies grow, they often benefit from establishing a centralized team or function focused on automation. However, the key is to strike a balance that doesn't impede individual teams' velocity, motivation, and innovation.

Here are some strategies we've seen successful organizations employ:

  1. Establish a Center of Excellence (COE): Create a centralized team that focuses on best practices, tooling, processes, and standards for automation. This team should aim to support and enable individual teams rather than taking over all automation efforts. (Read our COE ebook here)
  2. Develop Reusable Components: Encourage the creation and sharing of reusable automation components across the organization. This practice promotes standardization without forcing teams to abandon their preferred tools.
  3. Implement an Orchestration Layer: Utilize an automation orchestration platform that can integrate with existing tools while enforcing company-wide standards. This allows teams to continue using their preferred solutions while ensuring compliance with security, visibility, and self-service requirements. See the diagram below for a sample architecture.
  4. Promote Knowledge Sharing: Facilitate cross-team collaboration and knowledge exchange to spread best practices and innovative approaches throughout the organization.
  5. Balance Standardization and Flexibility: Identify areas where standardization is critical (e.g., security practices, compliance requirements) and areas where teams can have more autonomy (e.g., choice of scripting languages).

Orchestration platform can help implement automation standards while still giving teams autonomy to use their own tools

A real-world example: shifting left while maintaining standards

Many of our customers are working to "shift left" by empowering their development teams to implement runbooks as part of their service ownership. At the same time, they want to provide standardized auto-remediation capabilities for their Level 1 support teams during incident response. Not only will this free up developer time to focus on high-value work, but it will also allow support teams to take action when they need it, rather than waiting on experts to acknowledge and execute.

To achieve this balance, these organizations are leveraging centralized platforms that can orchestrate automation written by dev teams while providing a standardized interface for L1 responders. This approach allows for:

  • Dev team autonomy in creating and maintaining service-specific automation
  • Standardized processes for incident response
  • Improved knowledge sharing between development and operations teams

Preparing for the future

By finding the right balance between centralization and autonomy in automation, organizations not only optimize their current operations but also build resilience for future technological shifts-while minimizing risk for future cascading outages. When the next wave of innovation hits, teams will be better prepared to adopt new tools and practices within a flexible yet standardized framework.

As CIOs and technology leaders, the challenge is to create an environment that fosters innovation and agility while maintaining the necessary controls and standards to mitigate risk in the long run. By leveraging the right platforms and team structures for incident management and automation orchestration, you can achieve this balance and position your organization for long-term success in an increasingly automated world.

At PagerDuty, we're committed to helping organizations navigate these challenges and build resilient and scalable automation strategies. I encourage you to explore how our solutions can support your automation journey and help you strike the right balance for your organization.