Chaos Framework is a platform for easy resilience testing in Kubernetes. It automatically generates test scenario and executes it against your distributed app by simulating various failures.
The platform generates a test scenario consisting of several stages. Stages are executed in order they are listed in a scenario. Each stage consists of several simultaneously running steps where each step represents a single failure with a single target. A stage can contain several steps that target one specific target, or it can contain several steps with the same failure targeting different targets.
Current status of development is shown here:
- failures
- container CPU hog
- container memory hog
- container network corruption
- container network duplication
- container network latency
- container network loss
- node CPU hog
- node IO stress
- node memory hog
- pod delete
- pod IO stress
- node restart (not yet implemented)
- disk fill (not yet implemented)
- node crash (not yet implemented)
- HTTP message corruption (not yet implemented, see Muxy)
- load testing (not yet implemented, see Gatling)
- memory corruption, RNG starvation, DNS block (not yet implemented, see Chaos Engine)
- a lot more
- target selection
- target a random pod of a deployment
- target specific percentage of pods of a deployment
- target all pods of a deployment
- target specific node
- target a specific pod (can't select specific pod yet)
- target an entire cluster (no cluster-level failures implemented yet)
- test generation
- automatically generate test
- preview generated test
- test monitoring
- show test progress in a browser
- extensibility
- set conditions before and after each stage of the test
The platform does not have vendor-specific functionality and theoretically can be used in any Kubernetes cluster. It was tested in the following environments:
- cloud providers
- Azure Kubernetes Service
- Digital Ocean (Kubernetes v1.19.6; doesn't work on Kubernetes v1.20.2 due to errors in Argo Workflows)
- Windows 10 (2004)
- minikube with VirtualBox driver
- minikube with Docker driver (WSL2) - additional setup required
- Kubernetes cluster in Docker Desktop (WSL2) - additional setup required
- Linux (Ubuntu 20.04, Ubuntu 20.10, Fedora 34)
- minikube with Docker driver
- minikube with KVM driver
Note: when running on a platform where Linux kernel doesn't have netem module all network-related failures will not work! For Windows 10 WSL2 see this section.
On Windows 10 WSL2 all network-related failures with not work due to missing netem module in the default kernel. You have to either use another way to create a Kubernetes cluster or recompile and swap the default WSL2 kernel.
See WSL2 kernel Github repo and detailed instruction on how to recompile and swap it. Keep in mind that the instruction doesn't show you how to enable netem module, it only shows general process of modifying the kernel. You will have to find (use CTRL+F) substring “NETEM” in the config and change it.
The platform requires you to deploy several dependencies:
You can deploy them using instructions from official docs or use installation section below (recommended).
-
Make sure your Kubernetes cluster is running and
kubectl
is installed. -
Install requirements:
-
Install the latest stable Litmus (v2.7.0):
# Install Litmus operator. kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/mkdocs/docs/litmus-operator-v2.7.0.yaml # Install service account for Litmus. kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/2.7.0/mkdocs/docs/litmus-admin-rbac.yaml # Install generic experiments. kubectl apply -f https://hub.litmuschaos.io/api/chaos/2.7.0?file=charts/generic/experiments.yaml -n litmus
-
Install the latest stable Argo (v3.3.1):
kubectl create ns argo || echo "Namespace argo already exists." # Install service account and config map for Argo Workflows. kubectl apply -f deploy/argo.yaml -n litmus # Install Argo Workflows. kubectl apply -f https://github.com/argoproj/argo-workflows/releases/download/v3.3.1/install.yaml -n argo # Override auth method to from 'sso' to 'server'. kubectl patch deploy/argo-server -n argo -p '{"spec": {"template": {"spec": {"containers": [{"name": "argo-server", "args": ["server", "--auth-mode", "server"]}]}}}}' # Override runtime executor from 'emissary' to 'k8sapi'. kubectl patch deploy/workflow-controller -n argo -p '{"spec": {"template": {"spec": {"containers": [{"name": "workflow-controller", "args": ["--configmap", "workflow-controller-configmap-k8sapi", "--executor-image", "quay.io/argoproj/argoexec:v3.3.1"]}]}}}}'
-
-
Install the latest stable components of Chaos Framework:
kubectl create ns chaos-app || echo "Namespace chaos-app already exists." kubectl create ns chaos-framework || echo "Namespace chaos-framework already exists." kubectl apply -f https://raw.githubusercontent.com/iskorotkov/chaos-scheduler/master/deploy/scheduler.yaml kubectl apply -f https://raw.githubusercontent.com/iskorotkov/chaos-workflows/master/deploy/workflows.yaml kubectl apply -f https://raw.githubusercontent.com/iskorotkov/chaos-frontend/master/deploy/frontend.yaml
-
Install sample apps (or install yours):
# Server. kubectl apply -n chaos-app -f https://raw.githubusercontent.com/iskorotkov/chaos-server/master/deploy/counter.yaml # Client. kubectl apply -n chaos-app -f https://raw.githubusercontent.com/iskorotkov/chaos-client/master/deploy/counter.yaml
-
Launch test workflow:
-
Connect to Chaos Frontend:
kubectl port-forward -n chaos-framework svc/frontend 8080:80
-
Open
http://localhost:8811/
in your browser. -
Tweak parameters and launch a workflow.
-
(Optional) Connect to Argo:
kubectl port-forward -n argo svc/argo-server 2746:2746
-
(Optional) Open
http://localhost:2746/
in your browser for a more detailed info on workflows.
-
In order for your custom applications and deployments to work, make sure they have all the following:
- Deployments and pods must contain label
app=<name>
. This label is used for selecting targets in a target namespace. Label key can be changed via environment variables (see Scheduler's README). - Pods should contain a single container. Now it's only possible to induce failures on the first container in the pod, so all additional containers will be ignored.
Q: All failures don't work.
A: Check requirements. All target deployments and pods must have the matching label app=<name>
.
Q: Network-related failures don't work.
A: Check if netem kernel module is available.
Q: How to change Argo host/port/namespace (e. g. custom Argo deployment)?
Q: How to change Litmus namespace (e. g. custom Litmus deployment)?
Q: How to use another label for selecting targets?
Q: How to change a target namespace?
Q: How to change a duration, or an interval of test stages?
A: Download Scheduler manifest, change environment variables and redeploy it. See Scheduler's README for more info.
Q: I try to delete a Kubernetes resource, but it won't get deleted (i. e. kubectl is stuck at deletion).
A: Edit the resource manually in your text editor and remove all finalizers. If it doesn't work (or if the resource doesn't have any finalizers) find related resources and delete finalizers in them (e.g. when CRD deletion is stuck, delete all instances of this CRD).
Q: Test workflow always fails after stage completion due to no reason.
A: Kubernetes <=v1.19 may cause this due to docker
driver. Upgrade your cluster to >=v1.20 and use containerd
runtime (should be used by default), then check if error occurs again.
Chaos Framework components:
Sample apps: