Kubernetes,chaos engineering,Cloud native operations,C.N.O.
Andreas Krivas and Rafael Portela
18. August 2020
47 minutes of reading time
For most people, the word "chaos" means utter disorder and confusion. So what does it mean to design chaos? The distributed systems we build are increasingly complex, so their state cannot be predicted in all circumstances. From that perspective, they're chaotic, so we need to test them by introducing real-world chaos and see if they survive. We would like to share our recent discoveries with some of the open source projects we specialize in.chaos engineering. The focus on native Kubernetes tools means that we are particularly interested in leveraging Kubernetes. These benefits revolve around a layer of three main categories of components: These last categories in particular are addressed byKubernetes OperatorConcept, so we're going to talk about having a Kubernetes worker for each of the tools discussed. There are dozens of tools available at different levels of maturity. We have selected four to examine in more detail: We evaluated them in four specific categories: setup and management, design and variety of experiments, security, and observability. Before we get into the details, let's first review some basic principles of chaos engineering and the reasons why it's important. In software engineering, "chaos engineering" refers to efforts to identify vulnerabilities in a system before they manifest themselves to actual users. Or just break things on purpose to uncover hidden anomalies. In summary, these are the main aspects of chaos engineering experiments as defined by the chaos engineering community: The first important step is to define the steady state of the system: how it behaves under normal circumstances. This serves as a reference for the experiment. In this step, we formulate a hypothesis about the expected behavior of the system after introducing certain errors. Usually, this hypothesis naturally follows stationary lines, especially since the goal is to discover unidentified problems. It is important to introduce variables that represent real-world events and situations, such as an increase in traffic on an e-commerce site. The closer these variables are to real life, the more likely we are to detect real problems.What is chaos engineering and why is it important?
- refute the hypothesis
The point here is to evaluate if our hypothesis is refuted, that is, to determine if the system is not behaving as expected or if there is a significant deviation. This discrepancy needs to be closely monitored and reported as it serves as a basis for improving the identified problems.
A key difference between any other type of testing and chaos testing is that the goal is to run chaos experiments in production environments with real production workloads and traffic. Since the intention is to discover real hidden anomalies, it is crucial that we bring the chaos to real applications.
However, this further underscores the need to define controlled, safe, and observable experiments. Improper handling of the blast radius and scale of the experiment can produce the opposite effect: uncontrollable chaos in production systems.
OTL;DR
The above chaos engineering principles serve as guidelines for evaluating the four open source tools mentioned in the introduction. This section contains our main findings and conclusions from this research. In general, they all did very well, and the important thing is that they can all be useful depending on their purpose.
easy administration
All the tools seem to be largely native to Kubernetes in terms of installation and management. Chaos Toolkit, Litmus and Chaos Mesh use the concept of an operator, while Pumbaa proposes a DaemonSet.
Reports are controlled by the user
However, improvements are needed in terms of reporting the progress and results of experiments, as only Litmus provides a specific custom resource with relevant events and experiment results. In most cases, users must rely on their existing monitoring infrastructure.
other purpose
Looking at how these tools perform chaos engineering experiments, we find that only Litmus and the Chaos Toolkit have the concept of an experiment based on the chaos engineering principles described in the previous section. Both provide their own definition of an experiment, allowing them to function as Chaos Orchestrators.
On the other hand, Pumbaa and Chaos Mesh focus on running experiments, with Pumbaa offering a simple interface, while Chaos Mesh follows a more cloud-native approach and uses custom resource definitions to run experiments.
Experiments Affect Safety
All the tools have similar security limitations as they use similar features and methods under the hood to perform the experiments. If there's a noticeable difference, it's that the Chaos Toolkit and Litmus allow users to create more sophisticated experiments. Pumbaa and Chaos Mesh are more opinionated artists, which makes them less flexible when it comes to security.
It is important to note that the experiment actions themselves determine the necessary security constraints of the experiment. For example, network latency experiments may require higher privileges, while stopping a pod is a less intrusive action. Therefore, while the tools themselves can be considered secure, users must ensure that each experiment is well-designed from a security perspective.
How each tool works
Chaos Toolbox
Strong focus on extensibility,Chaos Toolboxit is intended to become the framework for creating custom chaos experiments and tools. It spans the entire lifecycle of experiments, allowing you to run checks (called probes) at the beginning of an experiment to check the health of a target application, followed by actions against the system to cause instability and check the expected end state. reached.
Existing packages, called driver extensions like AWS Driver or Kubernetes Driver, can be easily installed to make it easy to use additional actions on an expanded list of target platforms. New custom controllers can be created or existing ones can be extended to expose more types of tests and actions for experimentation. There is also a continuumOpen the Chaos Initiativewhose goal is to standardize Chaos experiments by using the Chaos Toolkit open API specifications.
Installation and Management
Installing the Chaos Toolkit is as easy as installing a Python package withNugget
install
. This will install theChaos
command line utility. If you want to use additional controller extensions to enable more specific actions and tests, e.g. For example, to interact with cloud services like GCP or Azure, you can similarly install the Python package of your desired extensions. There is also a Kubernetes extension that provides actions and probes for pods, services, deployments, and other resources, but this approach involves using the command line directly to run the experiments by referencing the JSON files that describe the steps in the experiment description .
Alternatively, there is a Kubernetes operator for the Chaos Toolkit with custom resource definitions (CRDs) that can be used to createExperiment
Cluster of resources that allow the use of our loved oneskubectl
to apply YAML manifest files. Based on these created objects and also using ConfigMaps that contain the JSON file entries that describe the steps and actions of the experiment, the Chaos Toolkit creates operator modules to run the experiments within the cluster while using them internally.Chaos
Command line tool.
Definition and variety of experiments.
Designing experiments is one of the best features of the Chaos Toolkit. It uses the JSON format to clearly define the experiments. Use probes and actions to control the experiment. Probes are used to check the stable state of resources, e.g. B. when accessing applications or retrieving metrics, while actions are used to change the state of resources or apply chaotic behavior through the use of an API or execution of a command. It also provides the ability to return at the end of the experiment, which helps clean up the mess in case of errors or clean up resources after the experiment is complete.
The following code snippet shows an example experiment definition.
{
"Version": "1.0.0",
"title": "System is resilient to provider failures",
"description": "Can our consumer normally survive a provider failure?",
"Sign": [
"Service",
"kubernetes",
"primavera"
],
"Ideas": {
"App Name": {
"type": "environment",
"chave": "LABEL_NAME"
},
"namespace": {
"type": "environment",
"chave": "NAMESPACE"
}
},
"steady state hypothesis": {
"title": "Deleting the pod the app is running in",
"probes": [
{
"Type": "Probe",
"name": "at least 2 application mirrors must be running",
"tolerance": 3,
"offerer": {
"typo": "python",
"module": "chaosk8s.pod.probes",
"func": "count_pods",
"Arguments": {
"label_selector": "aplicativo=${app_name}",
"ns": "${space_name}"
}
}
}
]
},
"Method": [
{
"type": "action",
"name": "Exit_pod",
"offerer": {
"typo": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"Arguments": {
"label_selector": "aplicativo=${app_name}",
"pattern_name": "${application_name}",
"ns": "${space_name}",
"edge": true,
"mode": "party",
"Quantity 1
}
},
"breaks": {
"after": 20
}
}
],
"returns": []
}
In the experiment above, the Chaos Toolkit first verifies that at least two replicas of the target application are running. The stable health check will stop one of the replicas using the Kubernetes controller referenced in the field."module":"chaosk8s.pod.actions"
. Optionally, we can specify a rollback action in case the experiment fails and we need to roll back the mess.
Since the tool mainly relies on the presence of drivers, there aren't many experiments that can be used right out of the box. The user is the one who has to define his own building blocks for his experiments using the controller extensions, which gives him a lot of freedom. There are already many generic drivers that can be used for various purposes (network, specific cloud provider, observability, probes and exporter among others), but new drivers will almost certainly have to be developed for more customized use.
Security
Like any other operator running on a cluster, the Chaos Toolkit operator needs a service account with sufficient privileges to perform its job, plus permissions to monitor and manage its own CRDs. For example, to run a simple experiment to remove an application pod in a specific namespace, the operator creates a Chaos Toolkit pod using a service account with sufficient privileges to remove pods. Chaos Toolkit custom pods can be configured by specifying thecapsule modelsin the Experiment custom function, which may require additional roles or secrets.
Depending on which additional drivers are used, specific network access or higher privileges may be required. This modular approach makes security easy because you can choose or build drivers to suit your own needs.
observability
Like most cloud-native tools these days, the Chaos Toolkit has a Prometheus controller for exporting metrics and experiment events. It also has an open tracing driver and a humio. However, the tool still does not provide a standardized report of the experiment results, which means that you can monitor the progress of the experiment by checking the Chaos Toolkit logs yourself.
verdict
The Chaos Toolkit is an open, extensible, lightweight, and well-defined implementation of Chaos. It relies heavily on the use of controllers that provide the ability to create a custom chaos tool following a structured experiment definition.
There are two things to keep in mind here. First, the Chaos Toolkit is not ideal as an out-of-the-box Kubernetes native chaos tool that does everything from start to finish, especially when it comes to defining different experiments. It should be used as a skeleton or API to create your own chaos engineering tools. Second, the duty to report and monitor thoroughly falls on the users to accommodate their own needs and infrastructure.
of the bomb
of the bombis a command line tool for testing Chaos, specifically focused on Docker containers. Pumbaa doesn't really cover the concepts of testing or experimentation, at least not as procedures that can succeed or fail depending on how the target applications respond. Rather, it specifically acts as a chaos injector that pulls in various Linux utilities to change the behavior of resources used by containers, such as CPU and network usage.
With an intuitive and responsive set of parameters, Pumbaa is easy to use from the command line and hides Linux commands well from the user. Considering the case of a host running multiple containerized applications to find the expected containers to apply chaos instead of changing the behavior of the host itself, Pumbaa takes advantage of this.APIExposed by the Docker daemon running on the host machine to find containers by name, ID, or tags when running on Kubernetes.
The user can choose from a variety of experiments related to container lifecycle management (stopping, terminating, pausing, or deleting a container), network manipulation between containers using network emulation (I did not), an extension of Traffic Control (t.c.) and load the target CPU withstress of.
Installation and Management
The Pumbaa command line tool can be used by installing its binary for the respective operating system or directly as a Docker image. In a Kubernetes cluster setup, a pod that loads the Pumbaa CLI tool can be deployed as a DaemonSet. The presence of a Pumbaa pod on all nodes makes it possible to find target applications without first knowing which node they are running on. (Note that you must delete the DaemonSet at the end of the experiment run.)
Because they are essentially CLI wrappers, Pumbaa pods deployed and running on the cluster must be volatile and only be active while the underlying command is running. Once the experiment is complete or there is a manual action to stop it (for example, removing Pumbaa's pods), Chaos Injection will be undone. For example, network delays turn on and off like a power button, where one tc command turns on and another returns everything back to normal.
There is currently no option to deploy Pumbaa as a Kubernetes operator, which would be a more controlled way to manage your experiments. The current deployment approach is difficult to manage in a multi-tenant environment, but multiple experiments can still be deployed simultaneously using different containers within the same DaemonSet.
Definition and variety of experiments.
In addition to this series of experiments mentioned above, it is not possible to add new ones without changing the source code of Pumbaa. Experiments to kill, stop, delete or pause containers are easy to use. It becomes interesting in network experiments through the use ofI did notcommands There, the user has a variety of options to create a custom Netem command that can cause delays, packet loss, rate limiting, and other types of network failures.
As an important note, Pumbaa fully targets a specific interface (eg eth0) and cannot attack specific ports or IPs on that interface.
The following manifest shows how to run a Pumbaa experiment by implementing Pumbaa as a DaemonSet:
apiVersion: Applications/v1
Type: DaemonSet
metadata:
Name: of the bomb
Specification:
Voter:
match tags:
Application: of the bomb
Model:
metadata:
Hang tags:
Application: of the bomb
com.gaiaadm.pumba: "TRUE" # Stop Pumbaa from committing suicide
Name: of the bomb
Specification:
Container:
# randomly add a 3000ms ± 30ms delay to the 'test-2' pod containers every 5m by 2m, where the delay variation is described by the 'normal' distribution,
- Image: gaiaadm/bomba
imagePullPolicy: Always
Name: pumba delay
Arguments:
- --arbitrarily
- --log level
- Information
- --Label
- io.kubernetes.pod.name=test-2
- --Interval
- 30S
- I did not
- --Duration
- 20S
- --tc-image
- gaiadocker/iproute2
- delay
- --Tempo
- "3000"
- - nervousness
- "30"
- --Distribution
- normal
Resources:
Requests for:
UPC: 10METRO
Storage: 5METRO
Boundaries:
UPC: 100METRO
Storage: 20METRO
Volume mounts:
- Name: Docker-Socket
Monte path: /var/run/docker.sock
Volume:
- host route:
Far: /var/run/docker.sock
Name: Docker-Socket
In this experiment, Pumbaa uses netem to induce latency on the target application's network interface. This is accomplished by using the namespace of the destination network and adding a delay to all IP addresses on that interface. As soon as the duration of the experience is exceeded (in this case, after 20 seconds), the experience ends.
Security
Unsurprisingly, the ability to manipulate network traffic or load a CPU core with powerful Linux tools would require an unusual set of privileges on the system. Ultimately, we want to keep these benefits away from commercial applications to minimize the surface area of security threats. In that sense specificallyLinux features, such as NET_ADMIN or SYS_ADMIN, may need to be mapped to the spec of the pod running Pumbaa to provide the correct permissions to change the appropriate settings.
Another important security aspect of Pumbaa is that it requires access to a file socket on the host node, where the underlying Docker daemon exposes its HTTP API, typically the/var/run/docker.sock
Archive. In addition, it may be necessary asprivilegedContainer. The reason is to query the container runtime, in this case Docker, to find the correct application containers for Pumbaa to use as targets. It's important to keep this in mind, as these are node-level privileges granted to Pumbaa.
observability
Other than reviewing the logs, Pumbaa does not provide any other way to report the results of the experiments. It is just a runtime tool that performs specific tasks. Once you're done, your job is done.
verdict
Pumba CLI is efficient and easy to use. It can act as a broker for specific experiments on a Kubernetes cluster, either from a DaemonSet perspective or just as a pod. Extensibility, manageability, and observability issues are minimal given the simplicity of the tool, making it less than ideal for a complicated multi-tenant environment with no other supporting tools. It's definitely an interesting tool for one-off stand-alone experiments or as part of a larger Chaos platform.
Although we mentioned some security issues due to the fact that some powerful (and potentially dangerous) Linux tools are used under the hood, this is not specific to Pumbaa. Any implementation of chaos that uses such methods will have similar considerations.
Litmus
LitmusIt is a complete chaos framework focused solely on Kubernetes workloads. It consists of an operator written in Go that currently uses three main CRDs to run an experiment:
- Chaos-Experiment: The design of the experiment with default parameters.
- Caosmotor: Binds the experiment definition to the target Chaos workload. If successful, the litmus operator starts the experiment, overriding any variables specified in the manifest file.
- result of chaos: Displays basic information about the progress and final result of the experiment.
After creating a Chaosengine object, Litmus creates the Chaos Runner pod in the target namespace. This runner orchestrates the experiment in the specified namespace and on the specified targets.
Target identification is something that makes litmus different. To set the target to zero, the user needs to enter a specific annotation in the deployment (more workloads are supported here: DaemonSet, StatefulSet, and DeploymentConfig). The user then needs to change the tags and fields on the Chaosengine object (an example is shown below) to allow Litmus to find all (or some) pods in the target deployment.
Once the operator verifies that all of the above requirements are met (correct tag, annotation, chaos experiment object, permissions), it creates an experiment execution pod that is responsible for running the experiment. This workflow allows you to limit the explosion radius of an experiment, as well as concurrent runs of experiments.
Installation and Management
The Litmus worker is a lightweight, stateless Go application that can be deployed as a simple deployment object to a Kubernetes cluster. Here litmus offers two ways to orchestrate the experiment. The default method is to restrict the experiment to a specific namespace, which is the same process described above. In this case, cluster administrators should consider resource utilization, since successful experiments depend on the resources available in each namespace.
Litmus also supports a management mode where the chaos corridor and the experiment corridor are created together with the operator in the same namespace. From there, the experiment runner finds the target application and namespace to run the experiment. This time, the focus is on centralizing the resources of the chaos created.
However, Litmus has some limitations, mainly in terms of observability, since in the case of multiple concurrent executions it is difficult to have a clear view of all experiments and cluster permissions, since in this case Litmus does not only require control over resources. related. APIs that cluster the workload, but also the node resources, as they require higher cluster privileges.
In terms of management, litmus is easy to use. However, it requires a bit more work when it comes to completing an experiment. This involves manual removal of tags, annotations, and CR removal, which must be automated by the user. Work is underway to create Argo workflows to add that extra layer of management when orchestrating multiple end-to-end experiments.
Definition and variety of experiments.
The cool part of litmus is that it provides a well-defined way to choose your own experiment runner. It uses the concept of chaos libraries that define the packages that will be used to run the experiment. For example, one experiment might use the Litmus native library to terminate a pod, and another experiment might use the Pumbaa library to run a network experiment.
This makes Litmus a very extensible and tool agnostic framework and not just another chaos injection tool. Likewise, depending on the selected broker, you can define your own experiment.
Internally, Litmus currently uses an Ansible runner to define and run the experiment based on the chosen chaos library. However, there is active development to create a lighter and simpler Go runner, which the community seems to see as the way forward.
Currently there are a variety of experiments supported by litmus. They use native or external Litmus tool libraries and can be foundHere.
Here is an example of the definition of the chaos engine that will launch and launch an experiment:
apiVersion: litmuschaos.io/v1alpha1
Type: chaos engine
metadata:
Name: App-Chaos
namespace: Standard
Specification:
Application information:
Applications: Standard
Application label: 'chaos = true'
Type of application: 'Implementation'
# Can be true/false. If true, apply Appinfo checks
Annotation check: 'TRUE'
# Can be active/stopped. Patch to stop to abort an experiment
engine status: 'asset'
Helper Application Information: ''
Chaos Service Account: <service account>
Supervision: INCORRECT
# Determines if the litmus is removed at the end of the experiment. Can be deleted/maintained
work cleaning policy: 'extinguish'
Try it:
- Name: pod-delete
Specification:
Components:
Surroundings:
# Set the duration of the chaos (in seconds) as desired
- Name: TOTAL_CHAOS_DURATION
Courage: '30'
# Set the chaos interval (in seconds) as desired
- Name: CHAOS_INTERVALO
Courage: '30'
# Pod errors without '--force' and default terminationGracePeriodSeconds
- Name: FORTALEZA
Courage: 'INCORRECT'
In this custom resource, we tell Litmus to run the pod removal experiment with specific parameters. Litmus will attempt to nullify the target with the.spec.appinfo
and it is already assumed that the user has made the correct annotations and labels, as explained in the litmus introduction. These fields specify the namespace, label, and object type of the target, and can be made optional if this is the case..spec.annotationCheck
The field is set to false.
The state of the motor can be set to stop, which abruptly ends the experiment. This is an important feature that can help you take action when chaos spreads to the system at large.
Finally, we can configure the experiment using the environment variables, which override the default values of the experiment definition.
Once the Chaosengine object has been validated and created, Litmus internally creates a regular Kubernetes job with all the necessary parameters that runs the experiment on the target.
Security
In terms of security, Litmus requires a well-defined set of permissions for cluster functions. Additionally, a requirement of any experiment is that the experiment's specific service account, role, and role membership objects exist in the target namespace. As discussed above, Litmus also provides a comprehensive way to identify target workloads, starting at the top-level object and ending at the pod level. This serves to limit the blast radius and ensure that chaos is only injected into the intended workloads.
Another interesting point is the permissions required of the executor pod. In the case of network experiments (for example with the Chaos Pumbaa library), we need the same permissions as mentioned above in Pumbaa, i.e. mount on the docker socket or add the appropriate resources in the security context . As we can see, litmus is a multifaceted framework with different layers that need proper attention from a security perspective.
observability
The litmus reporting page is primarily powered by the chaosresult custom resource. This is a customizable object that can be expanded with more details about the experiment. However, for the moment it provides very simple information, mainly about the state of the experiment, it shows important events and, finally, its result.
verdict
Litmus seems to be a very promising chaos engineering framework that focuses on extensibility and orchestration to create chaos in native Kubernetes workloads. He has a vibrant and supportive community behind him and was recently inducted into theCloud Native Computing Foundation como Sandbox-Projekt.
It can certainly be improved in terms of better reporting. However, it provides the framework to extend it. It can easily work in conjunction with any other tool, and its main purpose is to act as the orchestrator of chaos rather than the interpreter itself (although it can do this very efficiently as well). It is suitable for large and complicated systems that require a higher level of control and an increasing variety of experiments to perform.
chaos mesh
chaos meshis a chaos platform designed exclusively for Kubernetes applications. It was developed by PingCap to test the resiliency of their distributed TiDB database and is very easy to use for other types of applications running on Kubernetes.
Like a typical operator, a driver manager pod runs on a regular deployment and is responsible for monitoring its own CRDs (NetworkChaos, IoChaos, StressChaos, PodChaos, KernelChaos, and TimeChaos), which users can use to create new objects to specify. chaos and begin. experiments
To enable the requested actions against applications, the driver may need to contact the Chaos Mesh daemon service exposed as a DaemonSet so that it can, for example, manipulate the network stack locally to affect the execution of application modules. destination on the same physical node. . For the type of I/O chaos, like e.g. B. When simulating errors or delays when reading and writing to file systems, application modules should share their volume mounts with a sidecar container that intercepts file system calls. Sidecars are injected during application deployments that are backed by an ingest webhook.
Chaos Mesh is a good intermediate framework to use on its own, not as an orchestrator connecting different tools and extensions, while having a wide range of experiments out of the box.
Installation and Management
As a Kubernetes operator, installation is very simple and can be done by applying a set of manifests and CRDs to a cluster. A Helm chart is also available in the project repository, which simplifies installation with Helm. In terms of administration, it can be quite easy to use Helm charts as they are community driven. On the other hand, sidecar container injection and the need to use a DaemonSet complicates the operation of Chaos Mesh a bit, as it can be considered quite intrusive to the cluster.
Definition and variety of experiments.
The list of chaos types is grouped into the following categories: Network, Pod, I/O, Timing, Kernel, and Stress, each with its own CRD type. They all share a common selector input to find target groups as well as the optional duration or recurring schedule of chaos you want. Some of them, like NetworkChaos, have more options like delay, corruption, or partitioning.
An example of defining chaos in the network:
apiVersion: pingcap.com/v1alpha1
Type: chaos network
metadata:
Name: Example of network delay
namespace: chaos test
Specification:
Action: delay
Far: as
Voter:
Label selectors:
"app.kubernetes.io/componente": "Bitch"
Delay:
Latency: "90ms"
Correlation: "25"
Nervousness: "90ms"
Duration: "10er"
planer:
cron: "@every 15s"
It's usually easy to follow the examples to create a yaml file for your use case. One notable exception is the chaos with disk IO. To properly configure sidecars in app pods, additional configuration must be done beforehand using ConfigMaps to define what the sidecar containers will do when injected during user app deployment.
Security
Chaos Mesh also uses some Linux utilities to implement low-level chaos types. Also, you must use the Docker API on the host machine. Therefore, daemon pods (implemented as DaemonSet) run as privileged containers and are deployed/var/run/docker.sock
plug file. The Controller Manager pod needs permissions to manage MutatingWebhookConfiguration, plus some other expected role-based access control (RBAC) permissions when sidecar injection is enabled.
observability
The main project repository mentions a Chaos Panel side project, but it seems to work solely for testing with their database product. Building a more generic dashboard layout is on the roadmap. Until now, experiments with a state of chaos can be monitored by examining custom resource objects in the cluster.
verdict
Unlike other high-end tools on this list, Chaos Mesh does not have a strict concept of experimentation and is not an orchestrator with multiple implementation options. In that sense, it functions similarly to Pumbaa as a simple chaos injector. Since it's available as a Kubernetes operator with a bunch of messy options based on CRD types, it's certainly an easy tool to install and use. While the documentation could be better, the list of chaos types and configuration options is pretty impressive without the need for additional tools.
Other tools of chaos
These four chaos engineering tools are not the only ones available. The open source community is always creating something new and constantly contributing to existing projects.
tools likeChaos Sword(which is almost identical to Chaos Mesh),it was a mono,mighty seal,CubeInvaders,muddymitoxiproxythey are also very popular and have their own strengths and weaknesses. In addition to the open source space, there are also various products that contribute to chaos engineering, which is the most prominent.pixie, which is a complete chaos engineering trading platform.
In addition, various community events are also gaining importance these days, such as the Failover Conference, which offered many interesting insights into the world of website resiliency and chaos engineering.
Teach
Testing all of the above tools has shown us for sure that Kubernetes native chaos engineering is here to stay. The ability to run controlled experiments that represent real-world events on production systems may seem daunting at first, but it can certainly improve the quality of not only business applications, but infrastructure systems as well.
We have identified two main categories of chaos engineering tools: the chaos orchestrators from Litmus and Chaos Toolkit are the best known, and the chaos injectors from Pumbaa and Chaos Mesh. Chaos orchestrators aim to provide well-defined experiments using appropriate chaos engineering principles. Litmus is a more complete framework that still offers extensibility, while the Chaos Toolkit aims to become the standard API for defining experiments.
Chaos Injectors focus on running experiments. Pumbaa focuses on Docker containers and gives you the ability to create multiple experiments, and Chaos Mesh makes it easy to run experiments on Kubernetes instantly.
The bottom line is that chaos engineering can increase the resiliency of your production systems and uncover hidden problems that typically only show up in real-world events, wherever you are on your cloud-native journey. Depending on whether you need a runner or an orchestrator, there are many open source options available, each with their own advantages and disadvantages. The most important thing is to create chaos experiments that simulate real-world events in a well-defined, safe, and observable way.
Andreas Krivas is a Lead Native Cloud Engineer and Rafael Portela is a Native Cloud Engineer at Container Solutions. Christiaan Vermeulen, a cloud-native consultant at the company, contributed to this article.
You can find all of our SRE and CRE information in one place. ClickHere.