Splunk Inc.

09/17/2024 | News release | Distributed by Public on 09/17/2024 15:51

What’s Chaos Monkey? Its Role in Modern Testing

Chaos Monkey is an open-source tool. Its primary use is to check system reliability against random instance failures.

Chaos Monkey follows the testing concept of chaos engineering, which prepares networked systems for resilience against random and unpredictable chaotic conditions.

Let's take a deeper look.

What is Chaos Monkey?

Developed and released by Netflix, the GitHub for Chaos Monkey describes this open-source tool:

Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures.

The tool is based on the concepts of chaos engineering, which encourages experimentation and causing intentional incidents in order to test and ensure system reliability.

As such, it's often part of software testing and the quality assurance (QA) part of a software development pipeline or practice.

Other dev-related practices that touch on chaos engineering include be site reliability engineering (SRE), performance engineering, and even platform engineering.

Traditional QA in software development

In the traditional software engineering and quality assurance (QA) approach, the functional specifications of the software design also define its behavioral attributes.

In order to evaluate the behavior of an isolated software system, we can evaluate the output of all input conditions and functional parameters against a reference measurement. Various testing configurations and types can collectively - in theory - guarantee full test coverage.

But what happens in a large-scale, complex and distributed network environment?

Testing and QA in distributed systems

In complex distributed systems, as most organizations are, the functional specifications are not exhaustive: creating a specification that accurately outlines the mapping between an input and output combinations for every system component, node, and server is virtually impossible.

This means that the behavior of a system component is not fully known. That's due to two primary factors:

  • The scale and complexity of the wider system infrastructure itself.
  • External parameters such as user behavior.

So how do you characterize the behavior of these systems in an environment where IT incidents can occur randomly and unpredictably?

Principles of chaos engineering

Netflix famously pioneered the discipline of Chaos Engineering with the following principles:

Define a steady-state hypothesis

Identify a reference state that characterizes optimal working behavior of all system components. This definition can be vague: how do you describe a system behavior as optimal?

Availability metrics and dependability metrics are commonly chosen in the context of reliability engineering.

(Related reading: IT failure metrics.)

Vary real-world incidents

A series of computing operations can lead known inputs to known outputs. This refers to the execution path of a software operation. The traditional approach to software QA evaluates all a variety of execution paths as part of a full test coverage strategy.

Chaos engineering employs a different approach. It injects randomness into the variations within the execution path of a software system.

How does it achieve this? The Chaos Monkey tooling injects random disruptions by terminating virtual machines (VMs) and server instances in microservices-based cloud environments.

Perform experiments in a production environment

Testing in the real-world means replicating the production environment. The only challenge here is that an internet-scale production environment cannot be replicated on a small set of testing servers.

Even if a testing environment exists that can fully reproduce the real-world production environment, the core concept of chaos engineering is to evaluate system resilience against real-world and unpredictable scenarios.

That's why this principle exists: so that, no matter how closely your test environment is like your prod environment, Chaos engineering still wants you to perform experiments on prod.

Automate experiments for continuous testing

Automate experiments that are run against both control groups and experimental groups. The differences between the hypothesized steady state are measured.

This is a continuous process and automated using tools such as Chaos Monkey, which injects system failure but ensures that the overall system operations are feasible.

(Related reading: chaos testing & autonomous testing.)

The goal for Chaos Monkey: Intentional failures in production environments

The idea of introducing failures in a production environment is daunting for DevOps and QA teams - after all, they're striving to maintain maximum availability and mitigate the risk of downtime.

Chaos Monkey is in fact designed to limit the risks associated with testing in the production environment as part of its design philosophy and principles:

  • Random but realistic: Chaos Monkey injects failures in the system randomly. The distribution of generated incidents closely mirrors the distribution of real-world incidents.
  • Manageable: The incidents are not designed to bring down the entire service. Instead, Chaos Monkey injects minimal changes into the service by killing running server instance(s). In response, a dynamic workload distribution mechanism takes over, routing traffic requests and data communication between other available servers.
  • Full coverage: The tool is designed to fully cover the code executed by a logically centralized controller, such that it works as a single point of failure in response to a failure injection.

And what does it mean in practice for the users of Chaos Monkey?

Best practices for Chaos Monkey

When defining the failure scenarios, as part of developing a failure model, it is important to bridge the gap between the generated and real-world model distribution for failure incidents.

The tool itself is simple - it does not employ complex probabilistic models to mimic real-world incident trends and data distribution. You can easily simulate:

These test scenarios should be based on known performance on your dependability metrics. This means that the discussion around effective use of tools such as Chaos Monkey, and reliability engineering in general, is incomplete without a discussion around monitoring and observability.

In the context of failure injection, you should continuously monitor the internal and external states of your network. In essence, you should:

  1. First, identify and fully understand what changed: system behavior as measured by metrics, dependency mappings, user experience and response.
  2. Secondly, understand how failures and the resulting changes can be detected more efficiently.

Finally, the use of tools such as Chaos Monkey can also prepare your organization for a cultural change: a culture that accepts mistakes by modeling test scenarios of random and unpredictable IT incidents.

(Related reading: IT change management & organizational change management.)