Comparison of chaos engineering tools for Kubernetes workloads (2023)

Kubernetes,chaos engineering,Cloud native operations,C.N.O.

Andreas Krivas and Rafael Portela

18. August 2020

47 minutes of reading time

For most people, the word "chaos" means utter disorder and confusion. So what does it mean to design chaos? The distributed systems we build are increasingly complex, so their state cannot be predicted in all circumstances. From that perspective, they're chaotic, so we need to test them by introducing real-world chaos and see if they survive.

We would like to share our recent discoveries with some of the open source projects we specialize in.chaos engineering. The focus on native Kubernetes tools means that we are particularly interested in leveraging Kubernetes.

These benefits revolve around a layer of three main categories of components:

  • built-in components such as deployments, pods, and services;
  • Modeled custom components consisting of custom controllers or APIs that have already been modeled
  • bespoke custom components such as custom controllers and APIs for which there are no standards yet.

These last categories in particular are addressed byKubernetes OperatorConcept, so we're going to talk about having a Kubernetes worker for each of the tools discussed.

There are dozens of tools available at different levels of maturity. We have selected four to examine in more detail:

We evaluated them in four specific categories: setup and management, design and variety of experiments, security, and observability. Before we get into the details, let's first review some basic principles of chaos engineering and the reasons why it's important.

What is chaos engineering and why is it important?

In software engineering, "chaos engineering" refers to efforts to identify vulnerabilities in a system before they manifest themselves to actual users. Or just break things on purpose to uncover hidden anomalies.

Comparison of chaos engineering tools for Kubernetes workloads (2)

In summary, these are the main aspects of chaos engineering experiments as defined by the chaos engineering community:

  • Steady State Definition

The first important step is to define the steady state of the system: how it behaves under normal circumstances. This serves as a reference for the experiment.

  • formulation of a hypothesis

In this step, we formulate a hypothesis about the expected behavior of the system after introducing certain errors. Usually, this hypothesis naturally follows stationary lines, especially since the goal is to discover unidentified problems.

  • Introduction of real variables

It is important to introduce variables that represent real-world events and situations, such as an increase in traffic on an e-commerce site. The closer these variables are to real life, the more likely we are to detect real problems.

(Video) Chaos Mesh - Chaos Engineering for Kubernetes (CNCFMinutes 12)

  • refute the hypothesis

The point here is to evaluate if our hypothesis is refuted, that is, to determine if the system is not behaving as expected or if there is a significant deviation. This discrepancy needs to be closely monitored and reported as it serves as a basis for improving the identified problems.

A key difference between any other type of testing and chaos testing is that the goal is to run chaos experiments in production environments with real production workloads and traffic. Since the intention is to discover real hidden anomalies, it is crucial that we bring the chaos to real applications.

However, this further underscores the need to define controlled, safe, and observable experiments. Improper handling of the blast radius and scale of the experiment can produce the opposite effect: uncontrollable chaos in production systems.


The above chaos engineering principles serve as guidelines for evaluating the four open source tools mentioned in the introduction. This section contains our main findings and conclusions from this research. In general, they all did very well, and the important thing is that they can all be useful depending on their purpose.

Comparison of chaos engineering tools for Kubernetes workloads (3)

easy administration

All the tools seem to be largely native to Kubernetes in terms of installation and management. Chaos Toolkit, Litmus and Chaos Mesh use the concept of an operator, while Pumbaa proposes a DaemonSet.

Reports are controlled by the user

However, improvements are needed in terms of reporting the progress and results of experiments, as only Litmus provides a specific custom resource with relevant events and experiment results. In most cases, users must rely on their existing monitoring infrastructure.

other purpose

Looking at how these tools perform chaos engineering experiments, we find that only Litmus and the Chaos Toolkit have the concept of an experiment based on the chaos engineering principles described in the previous section. Both provide their own definition of an experiment, allowing them to function as Chaos Orchestrators.

On the other hand, Pumbaa and Chaos Mesh focus on running experiments, with Pumbaa offering a simple interface, while Chaos Mesh follows a more cloud-native approach and uses custom resource definitions to run experiments.

Experiments Affect Safety

All the tools have similar security limitations as they use similar features and methods under the hood to perform the experiments. If there's a noticeable difference, it's that the Chaos Toolkit and Litmus allow users to create more sophisticated experiments. Pumbaa and Chaos Mesh are more opinionated artists, which makes them less flexible when it comes to security.

It is important to note that the experiment actions themselves determine the necessary security constraints of the experiment. For example, network latency experiments may require higher privileges, while stopping a pod is a less intrusive action. Therefore, while the tools themselves can be considered secure, users must ensure that each experiment is well-designed from a security perspective.

How each tool works

Chaos Toolbox

Strong focus on extensibility,Chaos Toolboxit is intended to become the framework for creating custom chaos experiments and tools. It spans the entire lifecycle of experiments, allowing you to run checks (called probes) at the beginning of an experiment to check the health of a target application, followed by actions against the system to cause instability and check the expected end state. reached.

Existing packages, called driver extensions like AWS Driver or Kubernetes Driver, can be easily installed to make it easy to use additional actions on an expanded list of target platforms. New custom controllers can be created or existing ones can be extended to expose more types of tests and actions for experimentation. There is also a continuumOpen the Chaos Initiativewhose goal is to standardize Chaos experiments by using the Chaos Toolkit open API specifications.

Installation and Management

Installing the Chaos Toolkit is as easy as installing a Python package withNugget install. This will install theChaoscommand line utility. If you want to use additional controller extensions to enable more specific actions and tests, e.g. For example, to interact with cloud services like GCP or Azure, you can similarly install the Python package of your desired extensions. There is also a Kubernetes extension that provides actions and probes for pods, services, deployments, and other resources, but this approach involves using the command line directly to run the experiments by referencing the JSON files that describe the steps in the experiment description .

Alternatively, there is a Kubernetes operator for the Chaos Toolkit with custom resource definitions (CRDs) that can be used to createExperimentCluster of resources that allow the use of our loved oneskubectlto apply YAML manifest files. Based on these created objects and also using ConfigMaps that contain the JSON file entries that describe the steps and actions of the experiment, the Chaos Toolkit creates operator modules to run the experiments within the cluster while using them internally.ChaosCommand line tool.

Comparison of chaos engineering tools for Kubernetes workloads (4)

Definition and variety of experiments.

Designing experiments is one of the best features of the Chaos Toolkit. It uses the JSON format to clearly define the experiments. Use probes and actions to control the experiment. Probes are used to check the stable state of resources, e.g. B. when accessing applications or retrieving metrics, while actions are used to change the state of resources or apply chaotic behavior through the use of an API or execution of a command. It also provides the ability to return at the end of the experiment, which helps clean up the mess in case of errors or clean up resources after the experiment is complete.

The following code snippet shows an example experiment definition.

"Version": "1.0.0",
"title": "System is resilient to provider failures",
"description": "Can our consumer normally survive a provider failure?",
"Sign": [
"Ideas": {
"App Name": {
"type": "environment",
"chave": "LABEL_NAME"
"namespace": {
"type": "environment",
"chave": "NAMESPACE"
"steady state hypothesis": {
"title": "Deleting the pod the app is running in",
"probes": [
"Type": "Probe",
"name": "at least 2 application mirrors must be running",
"tolerance": 3,
"offerer": {
"typo": "python",
"module": "chaosk8s.pod.probes",
"func": "count_pods",
"Arguments": {
"label_selector": "aplicativo=${app_name}",
"ns": "${space_name}"
"Method": [
"type": "action",
"name": "Exit_pod",
"offerer": {
"typo": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"Arguments": {
"label_selector": "aplicativo=${app_name}",
"pattern_name": "${application_name}",
"ns": "${space_name}",
"edge": true,
"mode": "party",
"Quantity 1
"breaks": {
"after": 20
"returns": []

In the experiment above, the Chaos Toolkit first verifies that at least two replicas of the target application are running. The stable health check will stop one of the replicas using the Kubernetes controller referenced in the field."module":"chaosk8s.pod.actions". Optionally, we can specify a rollback action in case the experiment fails and we need to roll back the mess.

Since the tool mainly relies on the presence of drivers, there aren't many experiments that can be used right out of the box. The user is the one who has to define his own building blocks for his experiments using the controller extensions, which gives him a lot of freedom. There are already many generic drivers that can be used for various purposes (network, specific cloud provider, observability, probes and exporter among others), but new drivers will almost certainly have to be developed for more customized use.


Like any other operator running on a cluster, the Chaos Toolkit operator needs a service account with sufficient privileges to perform its job, plus permissions to monitor and manage its own CRDs. For example, to run a simple experiment to remove an application pod in a specific namespace, the operator creates a Chaos Toolkit pod using a service account with sufficient privileges to remove pods. Chaos Toolkit custom pods can be configured by specifying thecapsule modelsin the Experiment custom function, which may require additional roles or secrets.

Depending on which additional drivers are used, specific network access or higher privileges may be required. This modular approach makes security easy because you can choose or build drivers to suit your own needs.

(Video) What is Chaos Engineering? | Principles of Chaos | Tech Primers


Like most cloud-native tools these days, the Chaos Toolkit has a Prometheus controller for exporting metrics and experiment events. It also has an open tracing driver and a humio. However, the tool still does not provide a standardized report of the experiment results, which means that you can monitor the progress of the experiment by checking the Chaos Toolkit logs yourself.


The Chaos Toolkit is an open, extensible, lightweight, and well-defined implementation of Chaos. It relies heavily on the use of controllers that provide the ability to create a custom chaos tool following a structured experiment definition.

There are two things to keep in mind here. First, the Chaos Toolkit is not ideal as an out-of-the-box Kubernetes native chaos tool that does everything from start to finish, especially when it comes to defining different experiments. It should be used as a skeleton or API to create your own chaos engineering tools. Second, the duty to report and monitor thoroughly falls on the users to accommodate their own needs and infrastructure.

of the bomb

of the bombis a command line tool for testing Chaos, specifically focused on Docker containers. Pumbaa doesn't really cover the concepts of testing or experimentation, at least not as procedures that can succeed or fail depending on how the target applications respond. Rather, it specifically acts as a chaos injector that pulls in various Linux utilities to change the behavior of resources used by containers, such as CPU and network usage.

With an intuitive and responsive set of parameters, Pumbaa is easy to use from the command line and hides Linux commands well from the user. Considering the case of a host running multiple containerized applications to find the expected containers to apply chaos instead of changing the behavior of the host itself, Pumbaa takes advantage of this.APIExposed by the Docker daemon running on the host machine to find containers by name, ID, or tags when running on Kubernetes.

Comparison of chaos engineering tools for Kubernetes workloads (5)

The user can choose from a variety of experiments related to container lifecycle management (stopping, terminating, pausing, or deleting a container), network manipulation between containers using network emulation (I did not), an extension of Traffic Control (t.c.) and load the target CPU withstress of.

Installation and Management

The Pumbaa command line tool can be used by installing its binary for the respective operating system or directly as a Docker image. In a Kubernetes cluster setup, a pod that loads the Pumbaa CLI tool can be deployed as a DaemonSet. The presence of a Pumbaa pod on all nodes makes it possible to find target applications without first knowing which node they are running on. (Note that you must delete the DaemonSet at the end of the experiment run.)

Because they are essentially CLI wrappers, Pumbaa pods deployed and running on the cluster must be volatile and only be active while the underlying command is running. Once the experiment is complete or there is a manual action to stop it (for example, removing Pumbaa's pods), Chaos Injection will be undone. For example, network delays turn on and off like a power button, where one tc command turns on and another returns everything back to normal.

There is currently no option to deploy Pumbaa as a Kubernetes operator, which would be a more controlled way to manage your experiments. The current deployment approach is difficult to manage in a multi-tenant environment, but multiple experiments can still be deployed simultaneously using different containers within the same DaemonSet.

Definition and variety of experiments.

In addition to this series of experiments mentioned above, it is not possible to add new ones without changing the source code of Pumbaa. Experiments to kill, stop, delete or pause containers are easy to use. It becomes interesting in network experiments through the use ofI did notcommands There, the user has a variety of options to create a custom Netem command that can cause delays, packet loss, rate limiting, and other types of network failures.

As an important note, Pumbaa fully targets a specific interface (eg eth0) and cannot attack specific ports or IPs on that interface.

The following manifest shows how to run a Pumbaa experiment by implementing Pumbaa as a DaemonSet:

apiVersion: Applications/v1
Type: DaemonSet
Name: of the bomb
match tags:
Application: of the bomb
Hang tags:
Application: of the bomb
com.gaiaadm.pumba: "TRUE" # Stop Pumbaa from committing suicide
Name: of the bomb
# randomly add a 3000ms ± 30ms delay to the 'test-2' pod containers every 5m by 2m, where the delay variation is described by the 'normal' distribution,
- Image: gaiaadm/bomba
imagePullPolicy: Always
Name: pumba delay
- --arbitrarily
- --log level
- Information
- --Label
- --Interval
- 30S
- I did not
- --Duration
- 20S
- --tc-image
- gaiadocker/iproute2
- delay
- --Tempo
- "3000"
- - nervousness
- "30"
- --Distribution
- normal
Requests for:
Storage: 5METRO
Storage: 20METRO
Volume mounts:
- Name: Docker-Socket
Monte path: /var/run/docker.sock
- host route:
Far: /var/run/docker.sock
Name: Docker-Socket

In this experiment, Pumbaa uses netem to induce latency on the target application's network interface. This is accomplished by using the namespace of the destination network and adding a delay to all IP addresses on that interface. As soon as the duration of the experience is exceeded (in this case, after 20 seconds), the experience ends.


Unsurprisingly, the ability to manipulate network traffic or load a CPU core with powerful Linux tools would require an unusual set of privileges on the system. Ultimately, we want to keep these benefits away from commercial applications to minimize the surface area of ​​security threats. In that sense specificallyLinux features, such as NET_ADMIN or SYS_ADMIN, may need to be mapped to the spec of the pod running Pumbaa to provide the correct permissions to change the appropriate settings.

Another important security aspect of Pumbaa is that it requires access to a file socket on the host node, where the underlying Docker daemon exposes its HTTP API, typically the/var/run/docker.sockArchive. In addition, it may be necessary asprivilegedContainer. The reason is to query the container runtime, in this case Docker, to find the correct application containers for Pumbaa to use as targets. It's important to keep this in mind, as these are node-level privileges granted to Pumbaa.


Other than reviewing the logs, Pumbaa does not provide any other way to report the results of the experiments. It is just a runtime tool that performs specific tasks. Once you're done, your job is done.


Pumba CLI is efficient and easy to use. It can act as a broker for specific experiments on a Kubernetes cluster, either from a DaemonSet perspective or just as a pod. Extensibility, manageability, and observability issues are minimal given the simplicity of the tool, making it less than ideal for a complicated multi-tenant environment with no other supporting tools. It's definitely an interesting tool for one-off stand-alone experiments or as part of a larger Chaos platform.

Although we mentioned some security issues due to the fact that some powerful (and potentially dangerous) Linux tools are used under the hood, this is not specific to Pumbaa. Any implementation of chaos that uses such methods will have similar considerations.


LitmusIt is a complete chaos framework focused solely on Kubernetes workloads. It consists of an operator written in Go that currently uses three main CRDs to run an experiment:

  • Chaos-Experiment: The design of the experiment with default parameters.
  • Caosmotor: Binds the experiment definition to the target Chaos workload. If successful, the litmus operator starts the experiment, overriding any variables specified in the manifest file.
  • result of chaos: Displays basic information about the progress and final result of the experiment.

After creating a Chaosengine object, Litmus creates the Chaos Runner pod in the target namespace. This runner orchestrates the experiment in the specified namespace and on the specified targets.

(Video) Chaos Engineering on Kubernetes

Target identification is something that makes litmus different. To set the target to zero, the user needs to enter a specific annotation in the deployment (more workloads are supported here: DaemonSet, StatefulSet, and DeploymentConfig). The user then needs to change the tags and fields on the Chaosengine object (an example is shown below) to allow Litmus to find all (or some) pods in the target deployment.

Once the operator verifies that all of the above requirements are met (correct tag, annotation, chaos experiment object, permissions), it creates an experiment execution pod that is responsible for running the experiment. This workflow allows you to limit the explosion radius of an experiment, as well as concurrent runs of experiments.

Installation and Management

The Litmus worker is a lightweight, stateless Go application that can be deployed as a simple deployment object to a Kubernetes cluster. Here litmus offers two ways to orchestrate the experiment. The default method is to restrict the experiment to a specific namespace, which is the same process described above. In this case, cluster administrators should consider resource utilization, since successful experiments depend on the resources available in each namespace.

Litmus also supports a management mode where the chaos corridor and the experiment corridor are created together with the operator in the same namespace. From there, the experiment runner finds the target application and namespace to run the experiment. This time, the focus is on centralizing the resources of the chaos created.

However, Litmus has some limitations, mainly in terms of observability, since in the case of multiple concurrent executions it is difficult to have a clear view of all experiments and cluster permissions, since in this case Litmus does not only require control over resources. related. APIs that cluster the workload, but also the node resources, as they require higher cluster privileges.

In terms of management, litmus is easy to use. However, it requires a bit more work when it comes to completing an experiment. This involves manual removal of tags, annotations, and CR removal, which must be automated by the user. Work is underway to create Argo workflows to add that extra layer of management when orchestrating multiple end-to-end experiments.

Definition and variety of experiments.

The cool part of litmus is that it provides a well-defined way to choose your own experiment runner. It uses the concept of chaos libraries that define the packages that will be used to run the experiment. For example, one experiment might use the Litmus native library to terminate a pod, and another experiment might use the Pumbaa library to run a network experiment.

This makes Litmus a very extensible and tool agnostic framework and not just another chaos injection tool. Likewise, depending on the selected broker, you can define your own experiment.

Comparison of chaos engineering tools for Kubernetes workloads (6)

Internally, Litmus currently uses an Ansible runner to define and run the experiment based on the chosen chaos library. However, there is active development to create a lighter and simpler Go runner, which the community seems to see as the way forward.

Currently there are a variety of experiments supported by litmus. They use native or external Litmus tool libraries and can be foundHere.

Here is an example of the definition of the chaos engine that will launch and launch an experiment:

Type: chaos engine
Name: App-Chaos
namespace: Standard
Application information:
Applications: Standard
Application label: 'chaos = true'
Type of application: 'Implementation'
# Can be true/false. If true, apply Appinfo checks
Annotation check: 'TRUE'
# Can be active/stopped. Patch to stop to abort an experiment
engine status: 'asset'
Helper Application Information: ''
Chaos Service Account: <service account>
Supervision: INCORRECT
# Determines if the litmus is removed at the end of the experiment. Can be deleted/maintained
work cleaning policy: 'extinguish'
Try it:
- Name: pod-delete
# Set the duration of the chaos (in seconds) as desired
Courage: '30'
# Set the chaos interval (in seconds) as desired
Courage: '30'
# Pod errors without '--force' and default terminationGracePeriodSeconds
Courage: 'INCORRECT'

In this custom resource, we tell Litmus to run the pod removal experiment with specific parameters. Litmus will attempt to nullify the target with the.spec.appinfoand it is already assumed that the user has made the correct annotations and labels, as explained in the litmus introduction. These fields specify the namespace, label, and object type of the target, and can be made optional if this is the case..spec.annotationCheckThe field is set to false.

The state of the motor can be set to stop, which abruptly ends the experiment. This is an important feature that can help you take action when chaos spreads to the system at large.

Finally, we can configure the experiment using the environment variables, which override the default values ​​of the experiment definition.

Once the Chaosengine object has been validated and created, Litmus internally creates a regular Kubernetes job with all the necessary parameters that runs the experiment on the target.


In terms of security, Litmus requires a well-defined set of permissions for cluster functions. Additionally, a requirement of any experiment is that the experiment's specific service account, role, and role membership objects exist in the target namespace. As discussed above, Litmus also provides a comprehensive way to identify target workloads, starting at the top-level object and ending at the pod level. This serves to limit the blast radius and ensure that chaos is only injected into the intended workloads.

Another interesting point is the permissions required of the executor pod. In the case of network experiments (for example with the Chaos Pumbaa library), we need the same permissions as mentioned above in Pumbaa, i.e. mount on the docker socket or add the appropriate resources in the security context . As we can see, litmus is a multifaceted framework with different layers that need proper attention from a security perspective.


The litmus reporting page is primarily powered by the chaosresult custom resource. This is a customizable object that can be expanded with more details about the experiment. However, for the moment it provides very simple information, mainly about the state of the experiment, it shows important events and, finally, its result.


Litmus seems to be a very promising chaos engineering framework that focuses on extensibility and orchestration to create chaos in native Kubernetes workloads. He has a vibrant and supportive community behind him and was recently inducted into theCloud Native Computing Foundation como Sandbox-Projekt.

It can certainly be improved in terms of better reporting. However, it provides the framework to extend it. It can easily work in conjunction with any other tool, and its main purpose is to act as the orchestrator of chaos rather than the interpreter itself (although it can do this very efficiently as well). It is suitable for large and complicated systems that require a higher level of control and an increasing variety of experiments to perform.

(Video) Understanding Chaos Engineering

chaos mesh

chaos meshis a chaos platform designed exclusively for Kubernetes applications. It was developed by PingCap to test the resiliency of their distributed TiDB database and is very easy to use for other types of applications running on Kubernetes.

Like a typical operator, a driver manager pod runs on a regular deployment and is responsible for monitoring its own CRDs (NetworkChaos, IoChaos, StressChaos, PodChaos, KernelChaos, and TimeChaos), which users can use to create new objects to specify. chaos and begin. experiments

To enable the requested actions against applications, the driver may need to contact the Chaos Mesh daemon service exposed as a DaemonSet so that it can, for example, manipulate the network stack locally to affect the execution of application modules. destination on the same physical node. . For the type of I/O chaos, like e.g. B. When simulating errors or delays when reading and writing to file systems, application modules should share their volume mounts with a sidecar container that intercepts file system calls. Sidecars are injected during application deployments that are backed by an ingest webhook.

Comparison of chaos engineering tools for Kubernetes workloads (7)

Chaos Mesh is a good intermediate framework to use on its own, not as an orchestrator connecting different tools and extensions, while having a wide range of experiments out of the box.

Installation and Management

As a Kubernetes operator, installation is very simple and can be done by applying a set of manifests and CRDs to a cluster. A Helm chart is also available in the project repository, which simplifies installation with Helm. In terms of administration, it can be quite easy to use Helm charts as they are community driven. On the other hand, sidecar container injection and the need to use a DaemonSet complicates the operation of Chaos Mesh a bit, as it can be considered quite intrusive to the cluster.

Definition and variety of experiments.

The list of chaos types is grouped into the following categories: Network, Pod, I/O, Timing, Kernel, and Stress, each with its own CRD type. They all share a common selector input to find target groups as well as the optional duration or recurring schedule of chaos you want. Some of them, like NetworkChaos, have more options like delay, corruption, or partitioning.

An example of defining chaos in the network:

Type: chaos network
Name: Example of network delay
namespace: chaos test
Action: delay
Far: as
Label selectors:
"": "Bitch"
Latency: "90ms"
Correlation: "25"
Nervousness: "90ms"
Duration: "10er"
cron: "@every 15s"

It's usually easy to follow the examples to create a yaml file for your use case. One notable exception is the chaos with disk IO. To properly configure sidecars in app pods, additional configuration must be done beforehand using ConfigMaps to define what the sidecar containers will do when injected during user app deployment.


Chaos Mesh also uses some Linux utilities to implement low-level chaos types. Also, you must use the Docker API on the host machine. Therefore, daemon pods (implemented as DaemonSet) run as privileged containers and are deployed/var/run/docker.sockplug file. The Controller Manager pod needs permissions to manage MutatingWebhookConfiguration, plus some other expected role-based access control (RBAC) permissions when sidecar injection is enabled.


The main project repository mentions a Chaos Panel side project, but it seems to work solely for testing with their database product. Building a more generic dashboard layout is on the roadmap. Until now, experiments with a state of chaos can be monitored by examining custom resource objects in the cluster.


Unlike other high-end tools on this list, Chaos Mesh does not have a strict concept of experimentation and is not an orchestrator with multiple implementation options. In that sense, it functions similarly to Pumbaa as a simple chaos injector. Since it's available as a Kubernetes operator with a bunch of messy options based on CRD types, it's certainly an easy tool to install and use. While the documentation could be better, the list of chaos types and configuration options is pretty impressive without the need for additional tools.

Other tools of chaos

These four chaos engineering tools are not the only ones available. The open source community is always creating something new and constantly contributing to existing projects.

tools likeChaos Sword(which is almost identical to Chaos Mesh),it was a mono,mighty seal,CubeInvaders,muddymitoxiproxythey are also very popular and have their own strengths and weaknesses. In addition to the open source space, there are also various products that contribute to chaos engineering, which is the most prominent.pixie, which is a complete chaos engineering trading platform.

In addition, various community events are also gaining importance these days, such as the Failover Conference, which offered many interesting insights into the world of website resiliency and chaos engineering.


Testing all of the above tools has shown us for sure that Kubernetes native chaos engineering is here to stay. The ability to run controlled experiments that represent real-world events on production systems may seem daunting at first, but it can certainly improve the quality of not only business applications, but infrastructure systems as well.

We have identified two main categories of chaos engineering tools: the chaos orchestrators from Litmus and Chaos Toolkit are the best known, and the chaos injectors from Pumbaa and Chaos Mesh. Chaos orchestrators aim to provide well-defined experiments using appropriate chaos engineering principles. Litmus is a more complete framework that still offers extensibility, while the Chaos Toolkit aims to become the standard API for defining experiments.

Chaos Injectors focus on running experiments. Pumbaa focuses on Docker containers and gives you the ability to create multiple experiments, and Chaos Mesh makes it easy to run experiments on Kubernetes instantly.

The bottom line is that chaos engineering can increase the resiliency of your production systems and uncover hidden problems that typically only show up in real-world events, wherever you are on your cloud-native journey. Depending on whether you need a runner or an orchestrator, there are many open source options available, each with their own advantages and disadvantages. The most important thing is to create chaos experiments that simulate real-world events in a well-defined, safe, and observable way.

Andreas Krivas is a Lead Native Cloud Engineer and Rafael Portela is a Native Cloud Engineer at Container Solutions. Christiaan Vermeulen, a cloud-native consultant at the company, contributed to this article.

You can find all of our SRE and CRE information in one place. ClickHere.

(Video) What Is Chaos Engineering?

Comparison of chaos engineering tools for Kubernetes workloads (8)

AnteriorNext →
Comparison of chaos engineering tools for Kubernetes workloads (9)Comparison of chaos engineering tools for Kubernetes workloads (10)


1. Chaos Engineering for Kubernetes with Litmus
2. Performance Engineers Clubhouse - Evolution of Chaos Engineering
3. Making Sense of Chaos: Implementing Chaos Engineering in a Fintech... Iqbal Farabi & Giovanni Sakti
(CNCF [Cloud Native Computing Foundation])
4. Chaos testing Kubernetes based environment
5. AWS re:Invent 2022 - The evolution of chaos engineering at Netflix (NFX303)
(AWS Events)
6. Power Level 9000! Improving Application Performance with Chaos Engineering - S. Pathak & K. Gaekwad
(CNCF [Cloud Native Computing Foundation])


Top Articles
Latest Posts
Article information

Author: Wyatt Volkman LLD

Last Updated: 01/27/2023

Views: 5946

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Wyatt Volkman LLD

Birthday: 1992-02-16

Address: Suite 851 78549 Lubowitz Well, Wardside, TX 98080-8615

Phone: +67618977178100

Job: Manufacturing Director

Hobby: Running, Mountaineering, Inline skating, Writing, Baton twirling, Computer programming, Stone skipping

Introduction: My name is Wyatt Volkman LLD, I am a handsome, rich, comfortable, lively, zealous, graceful, gifted person who loves writing and wants to share my knowledge and understanding with you.