Scheduling, Preemption and Eviction - Kubernetes
Scheduling, Preemption and Eviction - Kubernetes
1: Kubernetes Scheduler
2: Assigning Pods to Nodes
3: Pod Overhead
4: Pod Scheduling Readiness
5: Pod Topology Spread Constraints
6: Taints and Tolerations
7: Scheduling Framework
8: Dynamic Resource Allocation
9: Scheduler Performance Tuning
10: Resource Bin Packing
11: Pod Priority and Preemption
12: Node-pressure Eviction
13: API-initiated Eviction
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the
kubelet can run them. Preemption is the process of terminating Pods with lower Priority so
that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating
one or more Pods on Nodes.
Scheduling
Kubernetes Scheduler
Assigning Pods to Nodes
Pod Overhead
Pod Topology Spread Constraints
Taints and Tolerations
Scheduling Framework
Dynamic Resource Allocation
Scheduler Performance Tuning
Resource Bin Packing for Extended Resources
Pod Scheduling Readiness
Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or
involuntarily.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 1/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
1 - Kubernetes Scheduler
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that
Kubelet can run them.
Scheduling overview
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that
the scheduler discovers, the scheduler becomes responsible for finding the best Node for that
Pod to run on. The scheduler reaches this placement decision taking into account the
scheduling principles described below.
If you want to understand why Pods are placed onto a particular Node, or if you're planning to
implement a custom scheduler yourself, this page will help you learn about scheduling.
kube-scheduler
kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane.
kube-scheduler is designed so that, if you want and need to, you can write your own
scheduling component and use that instead.
Kube-scheduler selects an optimal node to run newly created or not yet scheduled
(unscheduled) pods. Since containers in pods - and pods themselves - can have different
requirements, the scheduler filters out any nodes that don't meet a Pod's specific scheduling
needs. Alternatively, the API lets you specify a node for a Pod when you create it, but this is
unusual and is only done in special cases.
In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes.
If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to
place it.
The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the
feasible Nodes and picks a Node with the highest score among the feasible ones to run the
Pod. The scheduler then notifies the API server about this decision in a process called binding.
Factors that need to be taken into account for scheduling decisions include individual and
collective resource requirements, hardware / software / policy constraints, affinity and anti-
affinity specifications, data locality, inter-workload interference, and so on.
1. Filtering
2. Scoring
The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example,
the PodFitsResources filter checks whether a candidate Node has enough available resource
to meet a Pod's specific resource requests. After this step, the node list contains any suitable
Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod
placement. The scheduler assigns a score to each Node that survived filtering, basing this
score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more
than one node with equal scores, kube-scheduler selects one of these at random.
There are two supported ways to configure the filtering and scoring behavior of the
scheduler:
1. Scheduling Policies allow you to configure Predicates for filtering and Priorities for
scoring.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 2/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
2. Scheduling Profiles allow you to configure Plugins that implement different scheduling
stages, including: QueueSort , Filter , Score , Bind , Reserve , Permit , and others.
You can also configure the kube-scheduler to run different profiles.
What's next
Read about scheduler performance tuning
Read about Pod topology spread constraints
Read the reference documentation for kube-scheduler
Read the kube-scheduler config (v1beta3) reference
Learn about configuring multiple schedulers
Learn about topology management policies
Learn about Pod Overhead
Learn about scheduling of Pods that use volumes in:
Volume Topology Support
Storage Capacity Tracking
Node-specific Volume Limits
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 3/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
You can use any of the following methods to choose where Kubernetes schedules specific
Pods:
Node labels
Like many other Kubernetes objects, nodes have labels. You can attach labels manually.
Kubernetes also populates a standard set of labels on all nodes in a cluster. See Well-Known
Labels, Annotations and Taints for a list of common node labels.
Note: The value of these labels is cloud provider specific and is not guaranteed to be
reliable. For example, the value of kubernetes.io/hostname may be the same as the node
name in some environments and a different value in other environments.
Node isolation/restriction
Adding labels to nodes allows you to target Pods for scheduling on specific nodes or groups of
nodes. You can use this functionality to ensure that specific Pods only run on nodes with
certain isolation, security, or regulatory properties.
If you use labels for node isolation, choose label keys that the kubelet cannot modify. This
prevents a compromised node from setting those labels on itself so that the scheduler
schedules workloads onto the compromised node.
The NodeRestriction admission plugin prevents the kubelet from setting or modifying labels
with a node-restriction.kubernetes.io/ prefix.
1. Ensure you are using the Node authorizer and have enabled the NodeRestriction
admission plugin.
2. Add labels with the node-restriction.kubernetes.io/ prefix to your nodes, and use
those labels in your node selectors. For example, example.com.node-
restriction.kubernetes.io/fips=true or example.com.node-
restriction.kubernetes.io/pci-dss=true .
nodeSelector
nodeSelectoris the simplest recommended form of node selection constraint. You can add
the nodeSelector field to your Pod specification and specify the node labels you want the
target node to have. Kubernetes only schedules the Pod onto nodes that have each of the
labels you specify.
Node affinity functions like the nodeSelector field but is more expressive and allows you
to specify soft rules.
Inter-pod affinity/anti-affinity allows you to constrain Pods against labels on other Pods.
Node affinity
Node affinity is conceptually similar to nodeSelector , allowing you to constrain which nodes
your Pod can be scheduled on based on node labels. There are two types of node affinity:
Note: In the preceding types, IgnoredDuringExecution means that if the node labels
change after Kubernetes schedules the Pod, the Pod continues to run.
You can specify node affinities using the .spec.affinity.nodeAffinity field in your Pod
spec.
pods/pod-with-node-affinity.yaml
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 5/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-east1
- antarctica-west1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0
The node must have a label with the key topology.kubernetes.io/zone and the value of
that label must be either antarctica-east1 or antarctica-west1 .
The node preferably has a label with the key another-node-label-key and the value
another-node-label-value .
You can use the operator field to specify a logical operator for Kubernetes to use when
interpreting the rules. You can use In , NotIn , Exists , DoesNotExist , Gt and Lt .
and DoesNotExist allow you to define node anti-affinity behavior. Alternatively, you
NotIn
can use node taints to repel Pods from specific nodes.
Note:
If you specify both nodeSelector and nodeAffinity , both must be satisfied for the Pod
to be scheduled onto a node.
See Assign Pods to Nodes using Node Affinity for more information.
through every preferred rule that the node satisfies and adds the value of the weight for
that expression to a sum.
The final sum is added to the score of other priority functions for the node. Nodes with the
highest total score are prioritized when the scheduler makes a scheduling decision for the
Pod.
pods/pod-with-affinity-anti-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-affinity-anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: label-1
operator: In
values:
- key-1
- weight: 50
preference:
matchExpressions:
- key: label-2
operator: In
values:
- key-2
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0
Note: If you want Kubernetes to successfully schedule the Pods in this example, you must
have existing nodes with the kubernetes.io/os=linux label.
When configuring multiple scheduling profiles, you can associate a profile with a node affinity,
which is useful if a profile only applies to a specific set of nodes. To do so, add an
addedAffinity to the args field of the NodeAffinity plugin in the scheduler configuration.
For example:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 7/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
- schedulerName: foo-scheduler
pluginConfig:
- name: NodeAffinity
args:
addedAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: scheduler-profile
operator: In
values:
- foo
Since the addedAffinity is not visible to end users, its behavior might be unexpected to
them. Use node labels that have a clear correlation to the scheduler profile name.
Note: The DaemonSet controller, which creates Pods for DaemonSets, does not support
scheduling profiles. When the DaemonSet controller creates Pods, the default Kubernetes
scheduler places those Pods and honors any nodeAffinity rules in the DaemonSet
controller.
Inter-pod affinity and anti-affinity rules take the form "this Pod should (or, in the case of anti-
affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y",
where X is a topology domain like node, rack, cloud provider zone or region, or similar and Y is
the rule Kubernetes tries to satisfy.
You express these rules (Y) as label selectors with an optional associated list of namespaces.
Pods are namespaced objects in Kubernetes, so Pod labels also implicitly have namespaces.
Any label selectors for Pod labels should specify the namespaces in which Kubernetes should
look for those labels.
You express the topology domain (X) using a topologyKey , which is the key for the node label
that the system uses to denote the domain. For examples, see Well-Known Labels,
Annotations and Taints.
Note: Inter-pod affinity and anti-affinity require substantial amount of processing which
can slow down scheduling in large clusters significantly. We do not recommend using
them in clusters larger than several hundred nodes.
Note: Pod anti-affinity requires nodes to be consistently labelled, in other words, every
node in the cluster must have an appropriate label matching topologyKey. If some or all
nodes are missing the specified topologyKey label, it can lead to unintended behavior.
Similar to node affinity are two types of Pod affinity and anti-affinity as follows:
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
To use inter-pod affinity, use the affinity.podAffinity field in the Pod spec. For inter-pod
anti-affinity, use the affinity.podAntiAffinity field in the Pod spec.
pods/pod-with-pod-affinity.yaml
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: registry.k8s.io/pause:2.0
This example defines one Pod affinity rule and one Pod anti-affinity rule. The Pod affinity rule
uses the "hard" requiredDuringSchedulingIgnoredDuringExecution , while the anti-affinity
rule uses the "soft" preferredDuringSchedulingIgnoredDuringExecution .
The affinity rule says that the scheduler can only schedule a Pod onto a node if the node is in
the same zone as one or more existing Pods with the label security=S1 . More precisely, the
scheduler must place the Pod on a node that has the topology.kubernetes.io/zone=V label,
as long as there is at least one node in that zone that currently has one or more Pods with the
Pod label security=S1 .
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 9/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a
node that is in the same zone as one or more Pods with the label security=S2 . More
precisely, the scheduler should try to avoid placing the Pod on a node that has the
topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently
running Pods with the Security=S2 Pod label.
To get yourself more familiar with the examples of Pod affinity and anti-affinity, refer to the
design proposal.
You can use the In , NotIn , Exists and DoesNotExist values in the operator field for Pod
affinity and anti-affinity.
In principle, the topologyKey can be any allowed label key with the following exceptions for
performance and security reasons:
For Pod affinity and anti-affinity, an empty topologyKey field is not allowed in both
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution .
Namespace selector
FEATURE STATE: Kubernetes v1.24 [stable]
You can also select matching namespaces using namespaceSelector , which is a label query
over the set of namespaces. The affinity term is applied to namespaces selected by both
namespaceSelector and the namespaces field. Note that an empty namespaceSelector ({})
matches all namespaces, while a null or empty namespaces list and null namespaceSelector
matches the namespace of the Pod where the rule is defined.
For example: imagine a three-node cluster. You use the cluster to run a web application and
also an in-memory cache (such as Redis). For this example, also assume that latency between
the web application and the memory cache should be as low as is practical. You could use
inter-pod affinity and anti-affinity to co-locate the web servers with the cache as much as
possible.
In the following example Deployment for the Redis cache, the replicas get the label
app=store . The podAntiAffinity rule tells the scheduler to avoid placing multiple replicas
with the app=store label on a single node. This creates each cache in a separate node.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 10/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
The following example Deployment for the web servers creates replicas with the label
app=web-store . The Pod affinity rule tells the scheduler to place each replica on a node that
has a Pod with the label app=store . The Pod anti-affinity rule tells the scheduler never to
place multiple app=web-store servers on a single node.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 11/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
Creating the two preceding Deployments results in the following cluster layout, where each
web server is co-located with a cache, on three separate nodes.
The overall effect is that each cache instance is likely to be accessed by a single client, that is
running on the same node. This approach aims to minimize both skew (imbalanced load) and
latency.
You might have other reasons to use Pod anti-affinity. See the ZooKeeper tutorial for an
example of a StatefulSet configured with anti-affinity for high availability, using the same
technique as this example.
nodeName
nodeName is a more direct form of node selection than affinity or nodeSelector . nodeName is
a field in the Pod spec. If the nodeName field is not empty, the scheduler ignores the Pod and
the kubelet on the named node tries to place the Pod on that node. Using nodeName
overrules using nodeSelector or affinity and anti-affinity rules.
If the named node does not exist, the Pod will not run, and in some cases may be
automatically deleted.
If the named node does not have the resources to accommodate the Pod, the Pod will
fail and its reason will indicate why, for example OutOfmemory or OutOfcpu.
Node names in cloud environments are not always predictable or stable.
Note: nodeName is intended for use by custom schedulers or advanced use cases where
you need to bypass any configured schedulers. Bypassing the schedulers might lead to
failed Pods if the assigned Nodes get oversubscribed. You can use node affinity or a the
nodeselector field to assign a Pod to a specific Node without bypassing the schedulers.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01
Read Pod topology spread constraints to learn more about how these work.
Operators
The following are all the logical operators that you can use in the operator field for
nodeAffinity and podAffinity mentioned above.
Operator Behavior
NotIn The label value is not contained in the supplied set of strings
Operator Behaviour
Gt The supplied value will be parsed as an integer, and that integer is less than
or equal to the integer that results from parsing the value of a label named by
this selector
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 13/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Operator Behaviour
Lt The supplied value will be parsed as an integer, and that integer is greater
than or equal to the integer that results from parsing the value of a label
named by this selector
Note: Gt and Lt operators will not work with non-integer values. If the given value doesn't
parse as an integer, the pod will fail to get scheduled. Also, Gt and Lt are not available for
podAffinity.
What's next
Read more about taints and tolerations .
Read the design docs for node affinity and for inter-pod affinity/anti-affinity.
Learn about how the topology manager takes part in node-level resource allocation
decisions.
Learn how to use nodeSelector.
Learn how to use affinity and anti-affinity.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 14/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
3 - Pod Overhead
FEATURE STATE: Kubernetes v1.24 [stable]
When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
resources are additional to the resources needed to run the container(s) inside the Pod. In
Kubernetes, Pod Overhead is a way to account for the resources consumed by the Pod
infrastructure on top of the container requests & limits.
In Kubernetes, the Pod's overhead is set at admission time according to the overhead
associated with the Pod's RuntimeClass.
A pod's overhead is considered in addition to the sum of container resource requests when
scheduling a Pod. Similarly, the kubelet will include the Pod overhead when sizing the Pod
cgroup, and when carrying out Pod eviction ranking.
Usage example
To work with Pod overhead, you need a RuntimeClass that defines the overhead field. As an
example, you could use the following RuntimeClass definition with a virtualization container
runtime that uses around 120MiB per Pod for the virtual machine and the guest OS:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"
Workloads which are created which specify the kata-fc RuntimeClass handler will take the
memory and cpu overheads into account for resource quota calculations, node scheduling, as
well as Pod cgroup sizing.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 15/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
runtimeClassName: kata-fc
containers:
- name: busybox-ctr
image: busybox:1.28
stdin: true
tty: true
resources:
limits:
cpu: 500m
memory: 100Mi
- name: nginx-ctr
image: nginx
resources:
limits:
cpu: 1500m
memory: 100Mi
At admission time the RuntimeClass admission controller updates the workload's PodSpec to
include the overhead as described in the RuntimeClass. If the PodSpec already has this field
defined, the Pod will be rejected. In the given example, since only the RuntimeClass name is
specified, the admission controller mutates the Pod to include an overhead .
After the RuntimeClass admission controller has made modifications, you can check the
updated Pod overhead value:
map[cpu:250m memory:120Mi]
If a ResourceQuota is defined, the sum of container requests as well as the overhead field
are counted.
When the kube-scheduler is deciding which node should run a new Pod, the scheduler
considers that Pod's overhead as well as the sum of container requests for that Pod. For this
example, the scheduler adds the requests and the overhead, then looks for a node that has
2.25 CPU and 320 MiB of memory available.
Once a Pod is scheduled to a node, the kubelet on that node creates a new cgroup for the
Pod. It is within this pod that the underlying container runtime will create containers.
If the resource has a limit defined for each container (Guaranteed QoS or Burstable QoS with
limits defined), the kubelet will set an upper limit for the pod cgroup associated with that
resource (cpu.cfs_quota_us for CPU and memory.limit_in_bytes memory). This upper limit is
based on the sum of the container limits plus the overhead defined in the PodSpec.
For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set cpu.shares based on
the sum of container requests plus the overhead defined in the PodSpec.
Looking at our example, verify the container requests for the workload:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 16/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
The total container requests are 2000m CPU and 200MiB of memory:
The output shows requests for 2250m CPU, and for 320MiB of memory. The requests include
Pod overhead:
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limi
--------- ---- ------------ ---------- --------------- -----------
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%)
From this, you can determine the cgroup path for the Pod:
The resulting cgroup path includes the Pod's pause container. The Pod level cgroup is one
directory above.
"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd
335544320
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 17/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Observability
Some kube_pod_overhead_* metrics are available in kube-state-metrics to help identify when
Pod overhead is being utilized and to help observe stability of workloads running with a
defined overhead.
What's next
Learn more about RuntimeClass
Read the PodOverhead Design enhancement proposal for extra context
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 18/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Pods were considered ready for scheduling once created. Kubernetes scheduler does its due
diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods
may stay in a "miss-essential-resources" state for a long period. These Pods actually churn the
scheduler (and downstream integrators like Cluster AutoScaler) in an unnecessary manner.
pod created
pod running
Usage example
To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates
like this:
pods/pod-with-scheduling-gates.yaml
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 19/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
schedulingGates:
- name: example.com/foo
- name: example.com/bar
containers:
- name: pause
image: registry.k8s.io/pause:3.6
After the Pod's creation, you can check its state using:
[{"name":"example.com/foo"},{"name":"example.com/bar"}]
To inform scheduler this Pod is ready for scheduling, you can remove its schedulingGates
entirely by re-applying a modified manifest:
pods/pod-without-scheduling-gates.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: pause
image: registry.k8s.io/pause:3.6
The output is expected to be empty. And you can check its latest status by running:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 20/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's
state get transited from previous SchedulingGated to Running :
Observability
The metric scheduler_pending_pods comes with a new label "gated" to distinguish whether
a Pod has been tried scheduling but claimed as unschedulable, or explicitly marked as not
ready for scheduling. You can use scheduler_pending_pods{queue="gated"} to check the
metric result.
You can mutate scheduling directives of Pods while they have scheduling gates, with certain
constraints. At a high level, you can only tighten the scheduling directives of a Pod. In other
words, the updated directives would cause the Pods to only be able to be scheduled on a
subset of the nodes that it would previously match. More concretely, the rules for updating a
Pod's scheduling directives are as follows:
3. If NodeSelectorTerms was empty, it will be allowed to be set. If not empty, then only
additions of NodeSelectorRequirements to matchExpressions or fieldExpressions are
allowed, and no changes to existing matchExpressions and fieldExpressions will be
allowed. This is because the terms in
.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms , are ORed
while the expressions in nodeSelectorTerms[].matchExpressions and
nodeSelectorTerms[].fieldExpressions are ANDed.
What's next
Read the PodSchedulingReadiness KEP for more details
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 21/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
You can set cluster-level constraints as a default, or configure topology spread constraints for
individual workloads.
Motivation
Imagine that you have a cluster of up to twenty nodes, and you want to run a workload that
automatically scales how many replicas it uses. There could be as few as two Pods or as many
as fifteen. When there are only two Pods, you'd prefer not to have both of those Pods run on
the same node: you would run the risk that a single node failure takes your workload offline.
In addition to this basic usage, there are some advanced usage examples that enable your
workloads to benefit on high availability and cluster utilization.
As you scale up and run more Pods, a different concern becomes important. Imagine that you
have three nodes running five Pods each. The nodes have enough capacity to run that many
replicas; however, the clients that interact with this workload are split across three different
datacenters (or infrastructure zones). Now you have less concern about a single node failure,
but you notice that latency is higher than you'd like, and you are paying for network costs
associated with sending network traffic between the different zones.
You decide that under normal operation you'd prefer to have a similar number of replicas
scheduled into each infrastructure zone, and you'd like the cluster to self-heal in the case that
there is a problem.
Pod topology spread constraints offer you a declarative way to configure that.
topologySpreadConstraints field
The Pod API includes a field, spec.topologySpreadConstraints . The usage of this field looks
like the following:
---
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
# Configure a topology spread constraint
topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer> # optional; beta since v1.25
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
matchLabelKeys: <list> # optional; beta since v1.27
nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26
nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26
### other Pod fields go here
You can read more about this field by running kubectl explain
Pod.spec.topologySpreadConstraints or refer to scheduling section of the API reference for
Pod.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 22/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
maxSkew describes the degree to which Pods may be unevenly distributed. You must
specify this field and the number must be greater than zero. Its semantics differ
according to the value of whenUnsatisfiable :
Note: The minDomains field is a beta field and disabled by default in 1.25. You can
enable it by enabling the MinDomainsInPodTopologySpread feature gate.
The value of minDomains must be greater than 0, when specified. You can only
specify minDomains in conjunction with whenUnsatisfiable: DoNotSchedule .
When the number of eligible domains with match topology keys is less than
minDomains , Pod topology spread treats global minimum as 0, and then the
calculation of skew is performed. The global minimum is the minimum number of
matching Pods in an eligible domain, or zero if the number of eligible domains is
less than minDomains .
When the number of eligible domains with matching topology keys equals or is
greater than minDomains , this value has no effect on scheduling.
If you do not specify minDomains , the constraint behaves as if minDomains is 1.
topologyKey is the key of node labels. Nodes that have a label with this key and
identical values are considered to be in the same topology. We call each instance of a
topology (in other words, a <key, value> pair) a domain. The scheduler will try to put a
balanced number of pods into each domain. Also, we define an eligible domain as a
domain whose nodes meet the requirements of nodeAffinityPolicy and
nodeTaintsPolicy.
whenUnsatisfiable indicates how to deal with a Pod if it doesn't satisfy the spread
constraint:
ScheduleAnyway tells the scheduler to still schedule it while prioritizing nodes that
minimize the skew.
labelSelector is used to find matching Pods. Pods that match this label selector are
counted to determine the number of Pods in their corresponding topology domain. See
Label Selectors for more details.
matchLabelKeys is a list of pod label keys to select the pods over which spreading will
be calculated. The keys are used to lookup values from the pod labels, those key-value
labels are ANDed with labelSelector to select the group of existing pods over which
spreading will be calculated for the incoming pod. The same key is forbidden to exist in
both matchLabelKeys and labelSelector . matchLabelKeys cannot be set when
labelSelector isn't set. Keys that don't exist in the pod labels will be ignored. A null or
empty list means only match against the labelSelector .
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 23/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
With matchLabelKeys , you don't need to update the pod.spec between different
revisions. The controller/operator just needs to set different values to the same label key
for different revisions. The scheduler will assume the values automatically based on
matchLabelKeys . For example, if you are configuring a Deployment, you can use the
label keyed with pod-template-hash, which is added automatically by the Deployment
controller, to distinguish between different revisions in a single Deployment.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: foo
matchLabelKeys:
- pod-template-hash
Note: The matchLabelKeys field is a beta-level field and enabled by default in 1.27.
You can disable it by disabling the MatchLabelKeysInPodTopologySpread feature
gate.
nodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelector when
calculating pod topology spread skew. Options are:
nodeTaintsPolicy indicates how we will treat node taints when calculating pod topology
spread skew. Options are:
Honor: nodes without taints, along with tainted nodes for which the incoming pod
has a toleration, are included.
Ignore: node taints are ignored. All nodes are included.
If this value is null, the behavior is equivalent to the Ignore policy.
Note: The nodeTaintsPolicy is a beta-level field and enabled by default in 1.26. You
can disable it by disabling the NodeInclusionPolicyInPodTopologySpread feature
gate.
When a Pod defines more than one topologySpreadConstraint , those constraints are
combined using a logical AND operation: the kube-scheduler looks for a node for the
incoming Pod that satisfies all the configured constraints.
Node labels
Topology spread constraints rely on node labels to identify the topology domain(s) that each
node is in. For example, a node might have labels:
region: us-east-1
zone: us-east-1a
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 24/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Note:
For brevity, this example doesn't use the well-known label keys
topology.kubernetes.io/zone and topology.kubernetes.io/region . However, those
registered label keys are nonetheless recommended rather than the private (unqualified)
label keys region and zone that are used here.
You can't make a reliable assumption about the meaning of a private label key between
different contexts.
zoneA zoneB
Consistency
You should set the same Pod topology spread constraints on all pods in a group.
Usually, if you are using a workload controller such as a Deployment, the pod template takes
care of this for you. If you mix different spread constraints then Kubernetes follows the API
definition of the field; however, the behavior is more likely to become confusing and
troubleshooting is less straightforward.
You need a mechanism to ensure that all the nodes in a topology domain (such as a cloud
provider region) are labelled consistently. To avoid you needing to manually label nodes, most
clusters automatically populate well-known labels such as kubernetes.io/hostname . Check
whether your cluster supports this.
zoneA zoneB
If you want an incoming Pod to be evenly spread with existing Pods across zones, you can use
a manifest similar to:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 25/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
pods/topology-spread-constraints/one-constraint.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: registry.k8s.io/pause:3.1
From that manifest, topologyKey: zone implies the even distribution will only be applied to
nodes that are labelled zone: <any value> (nodes that don't have a zone label are skipped).
The field whenUnsatisfiable: DoNotSchedule tells the scheduler to let the incoming Pod stay
pending if the scheduler can't find a way to satisfy the constraint.
If the scheduler placed this incoming Pod into zone A , the distribution of Pods would become
[3, 1] . That means the actual skew is then 2 (calculated as 3 - 1 ), which violates maxSkew:
1 . To satisfy the constraints and context for this example, the incoming Pod can only be
placed onto a node in zone B :
zoneA zoneB
OR
zoneA zoneB
You can tweak the Pod spec to meet various kinds of requirements:
Change maxSkew to a bigger value - such as 2 - so that the incoming Pod can be placed
into zone A as well.
Change topologyKey to node so as to distribute the Pods evenly across nodes instead
of zones. In the above example, if maxSkew remains 1 , the incoming Pod can only be
placed onto the node node4 .
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 26/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
zoneA zoneB
You can combine two topology spread constraints to control the spread of Pods both by node
and by zone:
pods/topology-spread-constraints/two-constraints.yaml
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: registry.k8s.io/pause:3.1
In this case, to match the first constraint, the incoming Pod can only be placed onto nodes in
zone B ; while in terms of the second constraint, the incoming Pod can only be scheduled to
the node node4 . The scheduler only considers options that satisfy all defined constraints, so
the only valid placement is onto node node4 .
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 27/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
zoneA zoneB
If you were to apply two-constraints.yaml (the manifest from the previous example) to this
cluster, you would see that the Pod mypod stays in the Pending state. This happens because:
to satisfy the first constraint, the Pod mypod can only be placed into zone B ; while in terms
of the second constraint, the Pod mypod can only schedule to node node2 . The intersection
of the two constraints returns an empty set, and the scheduler cannot place the Pod.
To overcome this situation, you can either increase the value of maxSkew or modify one of the
constraints to use whenUnsatisfiable: ScheduleAnyway . Depending on circumstances, you
might also decide to delete an existing Pod manually - for example, if you are troubleshooting
why a bug-fix rollout is not making progress.
zoneA zoneB
zoneC
Node5
and you know that zone C must be excluded. In this case, you can compose a manifest as
below, so that Pod mypod will be placed into zone B instead of zone C . Similarly,
Kubernetes also respects spec.nodeSelector .
pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 28/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: NotIn
values:
- zoneC
containers:
- name: pause
image: registry.k8s.io/pause:3.1
Implicit conventions
There are some implicit conventions worth noting here:
Only the Pods holding the same namespace as the incoming Pod can be matching
candidates.
1. any Pods located on those bypassed nodes do not impact maxSkew calculation - in
the above example, suppose the node node1 does not have a label "zone", then
the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into zone
A .
2. the incoming Pod has no chances to be scheduled onto this kind of nodes - in the
above example, suppose a node node5 has the mistyped label zone-typo: zoneC
(and no zone label set). After node node5 joins the cluster, it will be bypassed and
Pods for this workload aren't scheduled there.
Be aware of what will happen if the incoming Pod's
topologySpreadConstraints[*].labelSelector doesn't match its own labels. In the
above example, if you remove the incoming Pod's labels, it can still be placed onto nodes
in zone B , since the constraints are still satisfied. However, after that placement, the
degree of imbalance of the cluster remains unchanged - it's still zone A having 2 Pods
labelled as foo: bar , and zone B having 1 Pod labelled as foo: bar . If this is not what
you expect, update the workload's topologySpreadConstraints[*].labelSelector to
match the labels in the pod template.
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List
If you don't configure any cluster-level default constraints for pod topology spreading, then
kube-scheduler acts as if you specified the following default topology constraints:
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
Also, the legacy SelectorSpread plugin, which provides an equivalent behavior, is disabled by
default.
Note:
The PodTopologySpread plugin does not score the nodes that don't have the topology
keys specified in the spreading constraints. This might result in a different default
behavior compared to the legacy SelectorSpread plugin when using the default topology
constraints.
If you don't want to use the default Pod spreading constraints for your cluster, you can
disable those defaults by setting defaultingType to List and leaving empty
defaultConstraints in the PodTopologySpread plugin configuration:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 30/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List
podAffinity
attracts Pods; you can try to pack any number of Pods into qualifying topology domain(s).
podAntiAffinity
For finer control, you can specify topology spread constraints to distribute Pods across
different topology domains - to achieve either high availability or cost-saving. This can also
help on rolling update workloads and scaling out replicas smoothly.
For more context, see the Motivation section of the enhancement proposal about Pod
topology spread constraints.
Known limitations
There's no guarantee that the constraints remain satisfied when Pods are removed. For
example, scaling down a Deployment may result in imbalanced Pods distribution.
You can use a tool such as the Descheduler to rebalance the Pods distribution.
The scheduler doesn't have prior knowledge of all the zones or other topology domains
that a cluster has. They are determined from the existing nodes in the cluster. This could
lead to a problem in autoscaled clusters, when a node pool (or node group) is scaled to
zero nodes, and you're expecting the cluster to scale up, because, in this case, those
topology domains won't be considered until there is at least one node in them.
You can work around this by using an cluster autoscaling tool that is aware of Pod
topology spread constraints and is also aware of the overall set of topology domains.
What's next
The blog article Introducing PodTopologySpread explains maxSkew in some detail, as
well as covering some advanced usage examples.
Read the scheduling section of the API reference for Pod.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 31/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Tolerations are applied to pods. Tolerations allow the scheduler to schedule pods with
matching taints. Tolerations allow scheduling but don't guarantee scheduling: the scheduler
also evaluates other parameters as part of its function.
Taints and tolerations work together to ensure that pods are not scheduled onto
inappropriate nodes. One or more taints are applied to a node; this marks that the node
should not accept any pods that do not tolerate the taints.
Concepts
You add a taint to a node using kubectl taint. For example,
places a taint on node node1 . The taint has key key1 , value value1 , and taint effect
NoSchedule . This means that no pod will be able to schedule onto node1 unless it has a
matching toleration.
To remove the taint added by the command above, you can run:
You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match"
the taint created by the kubectl taint line above, and thus a pod with either toleration
would be able to schedule onto node1 :
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"
pods/pod-with-toleration.yaml
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 32/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"
A toleration "matches" a taint if the keys are the same and the effects are the same, and:
Note:
There are two special cases:
An empty key with operator Exists matches all keys, values and effects which means
this will tolerate everything.
The above example used effect of NoSchedule . Alternatively, you can use effect of
PreferNoSchedule . This is a "preference" or "soft" version of NoSchedule -- the system will try
to avoid placing a pod that does not tolerate the taint on the node, but it is not required. The
third kind of effect is NoExecute , described later.
You can put multiple taints on the same node and multiple tolerations on the same pod. The
way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a
node's taints, then ignore the ones for which the pod has a matching toleration; the
remaining un-ignored taints have the indicated effects on the pod. In particular,
if there is at least one un-ignored taint with effect NoSchedule then Kubernetes will not
schedule the pod onto that node
if there is no un-ignored taint with effect NoSchedule but there is at least one un-
ignored taint with effect PreferNoSchedule then Kubernetes will try to not schedule the
pod onto the node
if there is at least one un-ignored taint with effect NoExecute then the pod will be
evicted from the node (if it is already running on the node), and will not be scheduled
onto the node (if it is not yet running on the node).
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 33/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
In this case, the pod will not be able to schedule onto the node, because there is no toleration
matching the third taint. But it will be able to continue running if it is already running on the
node when the taint is added, because the third taint is the only one of the three that is not
tolerated by the pod.
Normally, if a taint with effect NoExecute is added to a node, then any pods that do not
tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be
evicted. However, a toleration with NoExecute effect can specify an optional
tolerationSeconds field that dictates how long the pod will stay bound to the node after the
taint is added. For example,
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
means that if this pod is running and a matching taint is added to the node, then the pod will
stay bound to the node for 3600 seconds, and then be evicted. If the taint is removed before
that time, the pod will not be evicted.
Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a
particular set of users, you can add a taint to those nodes (say, kubectl taint nodes
nodename dedicated=groupName:NoSchedule ) and then add a corresponding toleration to
their pods (this would be done most easily by writing a custom admission controller).
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes
as well as any other nodes in the cluster. If you want to dedicate the nodes to them and
ensure they only use the dedicated nodes, then you should additionally add a label
similar to the taint to the same set of nodes (e.g. dedicated=groupName ), and the
admission controller should additionally add a node affinity to require that the pods can
only schedule onto nodes labeled with dedicated=groupName .
Nodes with Special Hardware: In a cluster where a small subset of nodes have
specialized hardware (for example GPUs), it is desirable to keep pods that don't need the
specialized hardware off of those nodes, thus leaving room for later-arriving pods that
do need the specialized hardware. This can be done by tainting the nodes that have the
specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or
kubectl taint nodes nodename special=true:PreferNoSchedule ) and adding a
corresponding toleration to pods that use the special hardware. As in the dedicated
nodes use case, it is probably easiest to apply the tolerations using a custom admission
controller. For example, it is recommended to use Extended Resources to represent the
special hardware, taint your special hardware nodes with the extended resource name
and run the ExtendedResourceToleration admission controller. Now, because the nodes
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 34/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
are tainted, no pods without the toleration will schedule on them. But when you submit
a pod that requests the extended resource, the ExtendedResourceToleration admission
controller will automatically add the correct toleration to the pod and that pod will
schedule on the special hardware nodes. This will make sure that these special
hardware nodes are dedicated for pods requesting such hardware and you don't have to
manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when there are node
problems, which is described in the next section.
The NoExecute taint effect, mentioned above, affects pods that are already running on the
node as follows
The node controller automatically taints a Node when certain conditions are true. The
following taints are built in:
In case a node is to be evicted, the node controller or the kubelet adds relevant taints with
NoExecute effect. If the fault condition returns to normal the kubelet or node controller can
remove the relevant taint(s).
In some cases when the node is unreachable, the API server is unable to communicate with
the kubelet on the node. The decision to delete the pods cannot be communicated to the
kubelet until communication with the API server is re-established. In the meantime, the pods
that are scheduled for deletion may continue to run on the partitioned node.
Note: The control plane limits the rate of adding node new taints to nodes. This rate
limiting manages the number of evictions that are triggered when many nodes become
unreachable at once (for example: if there is a network disruption).
You can specify tolerationSeconds for a Pod to define how long that Pod stays bound to a
failing or unresponsive Node.
For example, you might want to keep an application with a lot of local state bound to node for
a long time in the event of network partition, hoping that the partition will recover and thus
the pod eviction can be avoided. The toleration you set for that Pod might look like:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 35/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 6000
Note:
Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and
node.kubernetes.io/unreachable with tolerationSeconds=300 , unless you, or a
controller, set those tolerations explicitly.
These automatically-added tolerations mean that Pods remain bound to Nodes for 5
minutes after one of these problems is detected.
DaemonSet pods are created with NoExecute tolerations for the following taints with no
tolerationSeconds :
node.kubernetes.io/unreachable
node.kubernetes.io/not-ready
This ensures that DaemonSet pods are never evicted due to these problems.
The scheduler checks taints, not node conditions, when it makes scheduling decisions. This
ensures that node conditions don't directly affect scheduling. For example, if the
DiskPressure node condition is active, the control plane adds the node.kubernetes.io/disk-
pressure taint and does not schedule new pods onto the affected node. If the
MemoryPressure node condition is active, the control plane adds the
node.kubernetes.io/memory-pressure taint.
You can ignore node conditions for newly created pods by adding the corresponding Pod
tolerations. The control plane also adds the node.kubernetes.io/memory-pressure toleration
on pods that have a QoS class other than BestEffort . This is because Kubernetes treats
pods in the Guaranteed or Burstable QoS classes (even pods with no memory request set)
as if they are able to cope with memory pressure, while new BestEffort pods are not
scheduled onto the affected node.
The DaemonSet controller automatically adds the following NoSchedule tolerations to all
daemons, to prevent DaemonSets from breaking.
node.kubernetes.io/memory-pressure
node.kubernetes.io/disk-pressure
Adding these tolerations ensures backward compatibility. You can also add arbitrary
tolerations to DaemonSets.
What's next
Read about Node-pressure Eviction and how you can configure it
Read about Pod Priority
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 36/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
7 - Scheduling Framework
FEATURE STATE: Kubernetes v1.19 [stable]
The scheduling framework is a pluggable architecture for the Kubernetes scheduler. It adds a
new set of "plugin" APIs to the existing scheduler. Plugins are compiled into the scheduler.
The APIs allow most scheduling features to be implemented as plugins, while keeping the
scheduling "core" lightweight and maintainable. Refer to the design proposal of the
scheduling framework for more technical information on the design of the framework.
Framework workflow
The Scheduling Framework defines a few extension points. Scheduler plugins register to be
invoked at one or more extension points. Some of these plugins can change the scheduling
decisions and some are informational only.
Each attempt to schedule one Pod is split into two phases, the scheduling cycle and the
binding cycle.
Scheduling cycles are run serially, while binding cycles may run concurrently.
Extension points
The following picture shows the scheduling context of a Pod and the extension points that the
scheduling framework exposes. In this picture "Filter" is equivalent to "Predicate" and
"Scoring" is equivalent to "Priority function".
One plugin may register at multiple extension points to perform more complex or stateful
tasks.
PreEnqueue
These plugins are called prior to adding Pods to the internal active queue, where Pods are
marked as ready for scheduling.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 37/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Only when all PreEnqueue plugins return Success , the Pod is allowed to enter the active
queue. Otherwise, it's placed in the internal unschedulable Pods list, and doesn't get an
Unschedulable condition.
For more details about how internal scheduler queues work, read Scheduling queue in kube-
scheduler.
QueueSort
These plugins are used to sort Pods in the scheduling queue. A queue sort plugin essentially
provides a Less(Pod1, Pod2) function. Only one queue sort plugin may be enabled at a time.
PreFilter
These plugins are used to pre-process info about the Pod, or to check certain conditions that
the cluster or the Pod must meet. If a PreFilter plugin returns an error, the scheduling cycle is
aborted.
Filter
These plugins are used to filter out nodes that cannot run the Pod. For each node, the
scheduler will call filter plugins in their configured order. If any filter plugin marks the node as
infeasible, the remaining plugins will not be called for that node. Nodes may be evaluated
concurrently.
PostFilter
These plugins are called after Filter phase, but only when no feasible nodes were found for
the pod. Plugins are called in their configured order. If any postFilter plugin marks the node as
Schedulable , the remaining plugins will not be called. A typical PostFilter implementation is
preemption, which tries to make the pod schedulable by preempting other Pods.
PreScore
These plugins are used to perform "pre-scoring" work, which generates a sharable state for
Score plugins to use. If a PreScore plugin returns an error, the scheduling cycle is aborted.
Score
These plugins are used to rank nodes that have passed the filtering phase. The scheduler will
call each scoring plugin for each node. There will be a well defined range of integers
representing the minimum and maximum scores. After the NormalizeScore phase, the
scheduler will combine node scores from all plugins according to the configured plugin
weights.
NormalizeScore
These plugins are used to modify scores before the scheduler computes a final ranking of
Nodes. A plugin that registers for this extension point will be called with the Score results
from the same plugin. This is called once per plugin per scheduling cycle.
For example, suppose a plugin BlinkingLightScorer ranks Nodes based on how many
blinking lights they have.
However, the maximum count of blinking lights may be small compared to NodeScoreMax . To
fix this, BlinkingLightScorer should also register for this extension point.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 38/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Note: Plugins wishing to perform "pre-reserve" work should use the NormalizeScore
extension point.
Reserve
A plugin that implements the Reserve extension has two methods, namely Reserve and
Unreserve , that back two informational scheduling phases called Reserve and Unreserve,
respectively. Plugins which maintain runtime state (aka "stateful plugins") should use these
phases to be notified by the scheduler when resources on a node are being reserved and
unreserved for a given Pod.
The Reserve phase happens before the scheduler actually binds a Pod to its designated node.
It exists to prevent race conditions while the scheduler waits for the bind to succeed. The
Reserve method of each Reserve plugin may succeed or fail; if one Reserve method call
fails, subsequent plugins are not executed and the Reserve phase is considered to have failed.
If the Reserve method of all plugins succeed, the Reserve phase is considered to be
successful and the rest of the scheduling cycle and the binding cycle are executed.
The Unreserve phase is triggered if the Reserve phase or a later phase fails. When this
happens, the Unreserve method of all Reserve plugins will be executed in the reverse order
of Reserve method calls. This phase exists to clean up the state associated with the reserved
Pod.
Permit
Permit plugins are invoked at the end of the scheduling cycle for each Pod, to prevent or delay
the binding to the candidate node. A permit plugin can do one of the three things:
1. approve
Once all Permit plugins approve a Pod, it is sent for binding.
2. deny
If any Permit plugin denies a Pod, it is returned to the scheduling queue. This will trigger
the Unreserve phase in Reserve plugins.
Note: While any plugin can access the list of "waiting" Pods and approve them (see
FrameworkHandle), we expect only the permit plugins to approve binding of reserved Pods
that are in "waiting" state. Once a Pod is approved, it is sent to the PreBind phase.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 39/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
PreBind
These plugins are used to perform any work required before a Pod is bound. For example, a
pre-bind plugin may provision a network volume and mount it on the target node before
allowing the Pod to run there.
If any PreBind plugin returns an error, the Pod is rejected and returned to the scheduling
queue.
Bind
These plugins are used to bind a Pod to a Node. Bind plugins will not be called until all
PreBind plugins have completed. Each bind plugin is called in the configured order. A bind
plugin may choose whether or not to handle the given Pod. If a bind plugin chooses to handle
a Pod, the remaining bind plugins are skipped.
PostBind
This is an informational extension point. Post-bind plugins are called after a Pod is
successfully bound. This is the end of a binding cycle, and can be used to clean up associated
resources.
Plugin API
There are two steps to the plugin API. First, plugins must register and get configured, then
they use the extension point interfaces. Extension point interfaces have the following form.
// ...
Plugin configuration
You can enable or disable plugins in the scheduler configuration. If you are using Kubernetes
v1.18 or later, most scheduling plugins are in use and enabled by default.
In addition to default plugins, you can also implement your own scheduling plugins and get
them configured along with default plugins. You can visit scheduler-plugins for more details.
If you are using Kubernetes v1.18 or later, you can configure a set of plugins as a scheduler
profile and then define multiple profiles to fit various kinds of workload. Learn more at
multiple profiles.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 40/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Dynamic resource allocation is a new API for requesting and sharing resources between pods
and containers inside a pod. It is a generalization of the persistent volumes API for generic
resources. Third-party resource drivers are responsible for tracking and allocating resources.
Different kinds of resources support arbitrary parameters for defining requirements and
initialization.
API
The resource.k8s.io/v1alpha2 API group provides four new types:
ResourceClass
Defines which resource driver handles a certain kind of resource and provides common
parameters for it. ResourceClasses are created by a cluster administrator when installing a
resource driver.
ResourceClaim
Defines a particular resource instances that is required by a workload. Created by a user
(lifecycle managed manually, can be shared between different Pods) or for individual Pods
by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used
by just one Pod).
ResourceClaimTemplate
Defines the spec and some meta data for creating ResourceClaims. Created by a user
when deploying a workload.
PodSchedulingContext
Used internally by the control plane and resource drivers to coordinate pod scheduling
when ResourceClaims need to be allocated for a Pod.
Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically
using the type defined by a CRD that was created when installing a resource driver.
The core/v1 defines ResourceClaims that are needed for a Pod in a new
PodSpec
resourceClaims field. Entries in that list reference either a ResourceClaim or a
ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this PodSpec (for
example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When
referencing a ResourceClaimTemplate, each Pod gets its own instance.
The resources.claims list for container resources defines whether a container gets access to
these resource instances, which makes it possible to share resources between one or more
containers.
Here is an example for a fictional resource driver. Two ResourceClaim objects will get created
for this Pod and each container gets access to one of them.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 41/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com
---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cat-claim-parameters
spec:
color: black
size: large
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: large-black-cat-claim-template
spec:
spec:
resourceClassName: resource.example.com
parametersRef:
apiGroup: cats.resource.example.com
kind: ClaimParameters
name: large-black-cat-claim-parameters
–--
apiVersion: v1
kind: Pod
metadata:
name: pod-with-cats
spec:
containers:
- name: container0
image: ubuntu:20.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-0
- name: container1
image: ubuntu:20.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-1
resourceClaims:
- name: cat-0
source:
resourceClaimTemplateName: large-black-cat-claim-template
- name: cat-1
source:
resourceClaimTemplateName: large-black-cat-claim-template
Scheduling
In contrast to native resources (CPU, RAM) and extended resources (managed by a device
plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources
are available in a cluster or how they could be split up to satisfy the requirements of a specific
ResourceClaim. Resource drivers are responsible for that. They mark ResourceClaims as
"allocated" once resources for it are reserved. This also then tells the scheduler where in the
cluster a ResourceClaim is available.
ResourceClaims can get allocated as soon as they are created ("immediate allocation"),
without considering which Pods will use them. The default is to delay allocation until a Pod
gets scheduled which needs the ResourceClaim (i.e. "wait for first consumer").
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 42/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
In that mode, the scheduler checks all ResourceClaims needed by a Pod and creates a
PodScheduling object where it informs the resource drivers responsible for those
ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource
drivers respond by excluding nodes that don't have enough of the driver's resources left.
Once the scheduler has that information, it selects one node and stores that choice in the
PodScheduling object. The resource drivers then allocate their ResourceClaims so that the
resources will be available on that node. Once that is complete, the Pod gets scheduled.
As part of this process, ResourceClaims also get reserved for the Pod. Currently
ResourceClaims can either be used exclusively by a single Pod or an unlimited number of
Pods.
One key feature is that Pods do not get scheduled to a node unless all of their resources are
allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node
and then cannot run there, which is bad because such a pending Pod also blocks all other
resources like RAM or CPU that were set aside for it.
Monitoring resources
The kubelet provides a gRPC service to enable discovery of dynamic resources of running
Pods. For more information on the gRPC endpoints, see the resource allocation reporting.
Limitations
The scheduler plugin must be involved in scheduling Pods which use ResourceClaims.
Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses
to start because the ResourceClaims are not reserved or not even allocated. It may be
possible to remove this limitation in the future.
A quick check whether a Kubernetes cluster supports the feature is to list ResourceClass
objects with:
If your cluster supports dynamic resource allocation, the response is either a list of
ResourceClass objects or:
No resources found
In addition to enabling the feature in the cluster, a resource driver also has to be installed.
Please refer to the driver's documentation for details.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 43/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
What's next
For more information on the design, see the Dynamic Resource Allocation KEP.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 44/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Nodes in a cluster that meet the scheduling requirements of a Pod are called feasible Nodes
for the Pod. The scheduler finds feasible Nodes for a Pod and then runs a set of functions to
score the feasible Nodes, picking a Node with the highest score among the feasible ones to
run the Pod. The scheduler then notifies the API server about this decision in a process called
Binding.
This page explains performance tuning optimizations that are relevant for large Kubernetes
clusters.
In large clusters, you can tune the scheduler's behaviour balancing scheduling outcomes
between latency (new Pods are placed quickly) and accuracy (the scheduler rarely makes poor
placement decisions).
You configure this tuning setting via kube-scheduler setting percentageOfNodesToScore . This
KubeSchedulerConfiguration setting determines a threshold for scheduling nodes in your
cluster.
To change the value, edit the kube-scheduler configuration file and then restart the scheduler.
In many cases, the configuration file can be found at /etc/kubernetes/config/kube-
scheduler.yaml .
You specify a threshold for how many nodes are enough, as a whole number percentage of all
the nodes in your cluster. The kube-scheduler converts this into an integer number of nodes.
During scheduling, if the kube-scheduler has identified enough feasible nodes to exceed the
configured percentage, the kube-scheduler stops searching for more feasible nodes and
moves on to the scoring phase.
How the scheduler iterates over Nodes describes the process in detail.
Default threshold
If you don't specify a threshold, Kubernetes calculates a figure using a linear formula that
yields 50% for a 100-node cluster and yields 10% for a 5000-node cluster. The lower bound for
the automatic value is 5%.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 45/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
This means that, the kube-scheduler always scores at least 5% of your cluster no matter how
large the cluster is, unless you have explicitly set percentageOfNodesToScore to be smaller
than 5.
If you want the scheduler to score all nodes in your cluster, set percentageOfNodesToScore to
100.
Example
Below is an example configuration that sets percentageOfNodesToScore to 50%.
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider
...
percentageOfNodesToScore: 50
Tuning percentageOfNodesToScore
percentageOfNodesToScore must be a value between 1 and 100 with the default value being
calculated based on the cluster size. There is also a hardcoded minimum value of 50 nodes.
Note:
In clusters with less than 50 feasible nodes, the scheduler still checks all the nodes
because there are not enough feasible nodes to stop the scheduler's search early.
In a small cluster, if you set a low value for percentageOfNodesToScore , your change will
have no or little effect, for a similar reason.
If your cluster has several hundred Nodes or fewer, leave this configuration option at its
default value. Making changes is unlikely to improve the scheduler's performance
significantly.
An important detail to consider when setting this value is that when a smaller number of
nodes in a cluster are checked for feasibility, some nodes are not sent to be scored for a given
Pod. As a result, a Node which could possibly score a higher value for running the given Pod
might not even be passed to the scoring phase. This would result in a less than ideal
placement of the Pod.
You should avoid setting percentageOfNodesToScore very low so that kube-scheduler does
not make frequent, poor Pod placement decisions. Avoid setting the percentage to anything
below 10%, unless the scheduler's throughput is critical for your application and the score of
nodes is not important. In other words, you prefer to run the Pod on any Node as long as it is
feasible.
In order to give all the Nodes in a cluster a fair chance of being considered for running Pods,
the scheduler iterates over the nodes in a round robin fashion. You can imagine that Nodes
are in an array. The scheduler starts from the start of the array and checks feasibility of the
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 46/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
nodes until it finds enough Nodes as specified by percentageOfNodesToScore . For the next
Pod, the scheduler continues from the point in the Node array that it stopped at when
checking feasibility of Nodes for the previous Pod.
If Nodes are in multiple zones, the scheduler iterates over Nodes in various zones to ensure
that Nodes from different zones are considered in the feasibility checks. As an example,
consider six nodes in two zones:
What's next
Check the kube-scheduler configuration reference (v1beta3)
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 47/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
To set the MostAllocated strategy for the NodeResourcesFit plugin, use a scheduler
configuration similar to the following:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 3
type: MostAllocated
name: NodeResourcesFit
To learn more about other parameters and their default configuration, see the API
documentation for NodeResourcesFitArgs .
Below is an example configuration that sets the bin packing behavior for extended resources
intel.com/foo and intel.com/bar using the requestedToCapacityRatio field.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 48/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
scoringStrategy:
resources:
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 3
requestedToCapacityRatio:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
type: RequestedToCapacityRatio
name: NodeResourcesFit
To learn more about other parameters and their default configuration, see the API
documentation for NodeResourcesFitArgs .
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
The above arguments give the node a score of 0 if utilization is 0% and 10 for
utilization 100%, thus enabling bin packing behavior. To enable least requested the score
value must be reversed as follows.
shape:
- utilization: 0
score: 10
- utilization: 100
score: 0
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 49/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
resources:
- name: intel.com/foo
weight: 5
- name: cpu
weight: 3
- name: memory
weight: 1
The weight parameter is optional and is set to 1 if not specified. Also, the weight cannot be
set to a negative value.
Requested resources:
intel.com/foo : 2
memory: 256MB
cpu: 2
Resource weights:
intel.com/foo : 5
memory: 1
cpu: 3
Node 1 spec:
Available:
intel.com/foo: 4
memory: 1 GB
cpu: 8
Used:
intel.com/foo: 1
memory: 256MB
cpu: 1
Node score:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 50/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
intel.com/foo = resourceScoringFunction((2+1),4)
= (100 - ((4-3)*100/4)
= (100 - 25)
= 75 # requested + used = 75% * available
= rawScoringFunction(75)
= 7 # floor(75/10)
memory = resourceScoringFunction((256+256),1024)
= (100 -((1024-512)*100/1024))
= 50 # requested + used = 50% * available
= rawScoringFunction(50)
= 5 # floor(50/10)
cpu = resourceScoringFunction((2+1),8)
= (100 -((8-3)*100/8))
= 37.5 # requested + used = 37.5% * available
= rawScoringFunction(37.5)
= 3 # floor(37.5/10)
NodeScore = (7 * 5) + (5 * 1) + (3 * 3) / (5 + 1 + 3)
= 5
Node 2 spec:
Available:
intel.com/foo: 8
memory: 1GB
cpu: 8
Used:
intel.com/foo: 2
memory: 512MB
cpu: 6
Node score:
intel.com/foo = resourceScoringFunction((2+2),8)
= (100 - ((8-4)*100/8)
= (100 - 50)
= 50
= rawScoringFunction(50)
= 5
memory = resourceScoringFunction((256+512),1024)
= (100 -((1024-768)*100/1024))
= 75
= rawScoringFunction(75)
= 7
cpu = resourceScoringFunction((2+6),8)
= (100 -((8-8)*100/8))
= 100
= rawScoringFunction(100)
= 10
NodeScore = (5 * 5) + (7 * 1) + (10 * 3) / (5 + 1 + 3)
= 7
What's next
Read more about the scheduling framework
Read more about scheduler configuration
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 51/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a
Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make
scheduling of the pending Pod possible.
Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the
highest possible priorities, causing other Pods to be evicted/not get scheduled. An
administrator can use ResourceQuota to prevent users from creating pods at high
priorities.
2. Create Pods with priorityClassName set to one of the added PriorityClasses. Of course
you do not need to create the Pods directly; normally you would add
priorityClassName to the Pod template of a collection object like a Deployment.
PriorityClass
A PriorityClass is a non-namespaced object that defines a mapping from a priority class name
to the integer value of the priority. The name is specified in the name field of the PriorityClass
object's metadata. The value is specified in the required value field. The higher the value, the
higher the priority. The name of a PriorityClass object must be a valid DNS subdomain name,
and it cannot be prefixed with system- .
A PriorityClass object can have any 32-bit integer value smaller than or equal to 1 billion. This
means that the range of values for a PriorityClass object is from -2147483648 to 1000000000
inclusive. Larger numbers are reserved for built-in PriorityClasses that represent critical
system Pods. A cluster admin should create one PriorityClass object for each such mapping
that they want.
PriorityClass also has two optional fields: globalDefault and description . The
globalDefault field indicates that the value of this PriorityClass should be used for Pods
without a priorityClassName . Only one PriorityClass with globalDefault set to true can
exist in the system. If there is no PriorityClass with globalDefault set, the priority of Pods
with no priorityClassName is zero.
The description field is an arbitrary string. It is meant to tell users of the cluster when they
should use this PriorityClass.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 52/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Addition of a PriorityClass with globalDefault set to true does not change the
priorities of existing Pods. The value of such a PriorityClass is used only for Pods created
after the PriorityClass is added.
If you delete a PriorityClass, existing Pods that use the name of the deleted PriorityClass
remain unchanged, but you cannot create more Pods that use the name of the deleted
PriorityClass.
Example PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
Non-preempting PriorityClass
FEATURE STATE: Kubernetes v1.24 [stable]
Pods with preemptionPolicy: Never will be placed in the scheduling queue ahead of lower-
priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be
scheduled will stay in the scheduling queue, until sufficient resources are free, and it can be
scheduled. Non-preempting pods, like other pods, are subject to scheduler back-off. This
means that if the scheduler tries these pods and they cannot be scheduled, they will be
retried with lower frequency, allowing other pods with lower priority to be scheduled before
them.
An example use case is for data science workloads. A user may submit a job that they want to
be prioritized above other workloads, but do not wish to discard existing work by preempting
running pods. The high priority job with preemptionPolicy: Never will be scheduled ahead of
other queued pods, as soon as sufficient cluster resources "naturally" become free.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be preempted."
Pod priority
After you have one or more PriorityClasses, you can create Pods that specify one of those
PriorityClass names in their specifications. The priority admission controller uses the
priorityClassName field and populates the integer value of the priority. If the priority class is
not found, the Pod is rejected.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 53/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
The following YAML is an example of a Pod configuration that uses the PriorityClass created in
the preceding example. The priority admission controller checks the specification and
resolves the priority of the Pod to 1000000.
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
Preemption
When Pods are created, they go to a queue and wait to be scheduled. The scheduler picks a
Pod from the queue and tries to schedule it on a Node. If no Node is found that satisfies all
the specified requirements of the Pod, preemption logic is triggered for the pending Pod. Let's
call the pending Pod P. Preemption logic tries to find a Node where removal of one or more
Pods with lower priority than P would enable P to be scheduled on that Node. If such a Node
is found, one or more lower priority Pods get evicted from the Node. After the Pods are gone,
P can be scheduled on the Node.
Please note that Pod P is not necessarily scheduled to the "nominated Node". The scheduler
always tries the "nominated Node" before iterating over any other nodes. After victim Pods
are preempted, they get their graceful termination period. If another node becomes available
while scheduler is waiting for the victim Pods to terminate, scheduler may use the other node
to schedule Pod P. As a result nominatedNodeName and nodeName of Pod spec are not always
the same. Also, if scheduler preempts Pods on Node N, but then a higher priority Pod than
Pod P arrives, scheduler may give Node N to the new higher priority Pod. In such a case,
scheduler clears nominatedNodeName of Pod P. By doing this, scheduler makes Pod P eligible
to preempt Pods on another Node.
Limitations of preemption
schedule Pods in the pending queue. Therefore, there is usually a time gap between the point
that scheduler preempts victims and the time that Pod P is scheduled. In order to minimize
this gap, one can set graceful termination period of lower priority Pods to zero or a small
number.
Note: Preemption does not necessarily remove all lower-priority Pods. If the pending Pod
can be scheduled by removing fewer than all lower-priority Pods, then only a portion of
the lower-priority Pods are removed. Even so, the answer to the preceding question must
be yes. If the answer is no, the Node is not considered for preemption.
If a pending Pod has inter-pod affinity to one or more of the lower-priority Pods on the Node,
the inter-Pod affinity rule cannot be satisfied in the absence of those lower-priority Pods. In
this case, the scheduler does not preempt any Pods on the Node. Instead, it looks for another
Node. The scheduler might find a suitable Node or it might not. There is no guarantee that the
pending Pod can be scheduled.
Our recommended solution for this problem is to create inter-Pod affinity only towards equal
or higher priority Pods.
There are no other cases of anti-affinity between Pod P and other Pods in the Zone.
In order to schedule Pod P on Node N, Pod Q can be preempted, but scheduler does not
perform cross-node preemption. So, Pod P will be deemed unschedulable on Node N.
If Pod Q were removed from its Node, the Pod anti-affinity violation would be gone, and Pod P
could possibly be scheduled on Node N.
We may consider adding cross Node preemption in future versions if there is enough demand
and if we find an algorithm with reasonable performance.
Troubleshooting
Pod priority and pre-emption can have unwanted side effects. Here are some examples of
potential problems and ways to deal with them.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 55/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
To address the problem, you can change the priorityClassName for those Pods to use lower
priority classes, or leave that field empty. An empty priorityClassName is resolved to zero by
default.
When a Pod is preempted, there will be events recorded for the preempted Pod. Preemption
should happen only when a cluster does not have enough resources for a Pod. In such cases,
preemption happens only when the priority of the pending Pod (preemptor) is higher than the
victim Pods. Preemption must not happen when there is no pending Pod, or when the
pending Pods have equal or lower priority than the victims. If preemption happens in such
scenarios, please file an issue.
While the preemptor Pod is waiting for the victims to go away, a higher priority Pod may be
created that fits on the same Node. In this case, the scheduler will schedule the higher priority
Pod instead of the preemptor.
This is expected behavior: the Pod with the higher priority should take the place of a Pod with
a lower priority.
The scheduler tries to find nodes that can run a pending Pod. If no node is found, the
scheduler tries to remove Pods with lower priority from an arbitrary node in order to make
room for the pending pod. If a node with low priority Pods is not feasible to run the pending
Pod, the scheduler may choose another node with higher priority Pods (compared to the Pods
on the other node) for preemption. The victims must still have lower priority than the
preemptor Pod.
When there are multiple nodes available for preemption, the scheduler tries to choose the
node with a set of Pods with lowest priority. However, if such Pods have PodDisruptionBudget
that would be violated if they are preempted then the scheduler may choose another node
with higher priority Pods.
When multiple nodes exist for preemption and none of the above scenarios apply, the
scheduler chooses a node with the lowest priority.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 56/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
The kubelet uses Priority to determine pod order for node-pressure eviction. You can use the
QoS class to estimate the order in which pods are most likely to get evicted. The kubelet ranks
pods for eviction based on the following factors:
kubelet node-pressure eviction does not evict Pods when their usage does not exceed their
requests. If a Pod with lower priority is not exceeding its requests, it won't be evicted. Another
Pod with higher priority that exceeds its requests may be evicted.
What's next
Read about using ResourceQuotas in connection with PriorityClasses: limit Priority Class
consumption by default
Learn about Pod Disruption
Learn about API-initiated Eviction
Learn about Node-pressure Eviction
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 57/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
12 - Node-pressure Eviction
Node-pressure eviction is the process by which the kubelet proactively terminates pods to
reclaim resources on nodes.
The kubelet monitors resources like memory, disk space, and filesystem inodes on your
cluster's nodes. When one or more of these resources reach specific consumption levels, the
kubelet can proactively fail one or more pods on the node to reclaim resources and prevent
starvation.
During a node-pressure eviction, the kubelet sets the PodPhase for the selected pods to
Failed . This terminates the pods.
The kubelet does not respect your configured PodDisruptionBudget or the pod's
terminationGracePeriodSeconds . If you use soft eviction thresholds, the kubelet respects
your configured eviction-max-pod-grace-period . If you use hard eviction thresholds, it uses
a 0s grace period for termination.
If the pods are managed by a workload resource (such as StatefulSet or Deployment) that
replaces failed pods, the control plane or kube-controller-manager creates new pods in
place of the evicted pods.
Note: The kubelet attempts to reclaim node-level resources before it terminates end-user
pods. For example, it removes unused container images when disk resources are starved.
The kubelet uses various parameters to make eviction decisions, like the following:
Eviction signals
Eviction thresholds
Monitoring intervals
Eviction signals
Eviction signals are the current state of a particular resource at a specific point in time.
Kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction
thresholds, which are the minimum amount of the resource that should be available on the
node.
imagefs.avail imagefs.available :=
able node.stats.runtime.imagefs.available
imagefs.inode imagefs.inodesFree :=
sFree node.stats.runtime.imagefs.inodesFree
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 58/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
In this table, the Description column shows how kubelet gets the value of the signal. Each
signal supports either a percentage or a literal value. Kubelet calculates the percentage value
relative to the total capacity associated with the signal.
The value for memory.available is derived from the cgroupfs instead of tools like free -m .
This is important because free -m does not work in a container, and if users use the node
allocatable feature, out of resource decisions are made local to the end user Pod part of the
cgroup hierarchy as well as the root node. This script reproduces the same set of steps that
the kubelet performs to calculate memory.available . The kubelet excludes inactive_file (i.e. #
of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that
memory is reclaimable under pressure.
1. : The node's main filesystem, used for local disk volumes, emptyDir, log storage,
nodefs
and more. For example, nodefs contains /var/lib/kubelet/ .
2. imagefs : An optional filesystem that container runtimes use to store container images
and container writable layers.
Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet does not
support other configurations.
Eviction thresholds
You can specify custom eviction thresholds for the kubelet to use when it makes eviction
decisions.
quantity is the eviction threshold amount, such as 1Gi . The value of quantity must
match the quantity representation used by Kubernetes. You can use either literal values
or percentages ( % ).
For example, if a node has 10Gi of total memory and you want trigger eviction if the
available memory falls below 1Gi , you can define the eviction threshold as either
memory.available<10% or memory.available<1Gi . You cannot use both.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 59/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
You can specify both a soft eviction threshold grace period and a maximum allowed pod
termination grace period for kubelet to use during evictions. If you specify a maximum
allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the
two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills
evicted pods immediately without graceful termination.
You can use the following flags to configure soft eviction thresholds:
You can use the eviction-hard flag to configure a set of hard eviction thresholds like
memory.available<1Gi .
memory.available<100Mi
nodefs.available<10%
imagefs.available<15%
These default values of hard eviction thresholds will only be set if none of the parameters is
changed. If you changed the value of any parameter, then the values of other parameters will
not be inherited as the default values and will be set to zero. In order to provide custom
values, you should provide all the thresholds respectively.
Node conditions
The kubelet reports node conditions to reflect that the node is under pressure because hard
or soft eviction threshold is met, independent of configured grace periods.
Node
Condition Eviction Signal Description
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 60/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
Node
Condition Eviction Signal Description
The kubelet updates the node conditions based on the configured --node-status-update-
frequency , which defaults to 10s .
When a DiskPressure node condition is reported, the kubelet reclaims node-level resources
based on the filesystems on the node.
With imagefs
If the node has a dedicated imagefs filesystem for container runtimes to use, the kubelet
does the following:
If the nodefs filesystem meets the eviction thresholds, the kubelet garbage collects
dead pods and containers.
If the imagefs filesystem meets the eviction thresholds, the kubelet deletes all unused
images.
Without imagefs
If the node only has a nodefs filesystem that meets eviction thresholds, the kubelet frees up
disk space in the following order:
The kubelet uses the following parameters to determine the pod eviction order:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 61/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
1. BestEffort or Burstable pods where the usage exceeds requests. These pods are
evicted based on their Priority and then by how much their usage level exceeds the
request.
2. Guaranteed pods and Burstable pods where the usage is less than requests are
evicted last, based on their Priority.
Note: The kubelet does not use the pod's QoS class to determine the eviction order. You
can use the QoS class to estimate the most likely pod eviction order when reclaiming
resources like memory. QoS does not apply to EphemeralStorage requests, so the above
scenario will not apply if the node is, for example, under DiskPressure.
Guaranteed pods are guaranteed only when requests and limits are specified for all the
containers and they are equal. These pods will never be evicted because of another pod's
resource consumption. If a system daemon (such as kubelet and journald ) is consuming
more resources than were reserved via system-reserved or kube-reserved allocations, and
the node only has Guaranteed or Burstable pods using less resources than requests left on
it, then the kubelet must choose to evict one of these pods to preserve node stability and to
limit the impact of resource starvation on other pods. In this case, it will choose to evict pods
of lowest Priority first.
When the kubelet evicts pods in response to inode or PID starvation, it uses the Priority to
determine the eviction order, because inodes and PIDs have no requests.
The kubelet sorts pods differently based on whether the node has a dedicated imagefs
filesystem:
With imagefs
If nodefs is triggering evictions, the kubelet sorts pods based on nodefs usage ( local
volumes + logs of all containers ).
If imagefs is triggering evictions, the kubelet sorts pods based on the writable layer usage of
all containers.
Without imagefs
If nodefs is triggering evictions, the kubelet sorts pods based on their total disk usage ( local
volumes + logs & writable layer of all containers )
You can use the --eviction-minimum-reclaim flag or a kubelet config file to configure a
minimum reclaim amount for each resource. When the kubelet notices that a resource is
starved, it continues to reclaim that resource until it reclaims the quantity you specify.
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "1Gi"
imagefs.available: "100Gi"
evictionMinimumReclaim:
memory.available: "0Mi"
nodefs.available: "500Mi"
imagefs.available: "2Gi"
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 62/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
In this example, if the nodefs.available signal meets the eviction threshold, the kubelet
reclaims the resource until the signal reaches the threshold of 1Gi , and then continues to
reclaim the minimum amount of 500Mi it until the signal reaches 1.5Gi .
Similarly, the kubelet reclaims the imagefs resource until the imagefs.available signal
reaches 102Gi .
The kubelet sets an oom_score_adj value for each container based on the QoS for the pod.
Quality of
Service oom_score_adj
Guaranteed -997
BestEffort 1000
Note: The kubelet also sets an oom_score_adj value of -997 for containers in Pods that
have system-node-critical Priority.
If the kubelet can't reclaim memory before a node experiences OOM, the oom_killer
calculates an oom_score based on the percentage of memory it's using on the node, and then
adds the oom_score_adj to get an effective oom_score for each container. It then kills the
container with the highest score.
This means that containers in low QoS pods that consume a large amount of memory relative
to their scheduling requests are killed first.
Unlike pod eviction, if a container is OOM killed, the kubelet can restart it based on its
RestartPolicy .
Best practices
The following sections describe best practices for eviction configuration.
Operator wants to evict Pods at 95% memory utilization to reduce incidence of system
OOM.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 63/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi
In this configuration, the --system-reserved flag reserves 1.5Gi of memory for the system,
which is 10% of the total memory + the eviction threshold amount .
The node can reach the eviction threshold if a pod is using more than its request, or if the
system is using more than 1Gi of memory, which makes the memory.available signal fall
below 500Mi and triggers the threshold.
DaemonSet
Pod Priority is a major factor in making eviction decisions. If you do not want the kubelet to
evict pods that belong to a DaemonSet , give those pods a high enough priorityClass in the
pod spec. You can also use a lower priorityClass or the default to only allow DaemonSet
pods to run when there are enough resources.
Known issues
The following sections describe known issues related to out of resource handling.
You can use the --kernel-memcg-notification flag to enable the memcg notification API on
the kubelet to get notified immediately when a threshold is crossed.
If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a
viable workaround for this issue is to use the --kube-reserved and --system-reserved flags
to allocate memory for the system.
You can work around that behavior by setting the memory limit and memory request the
same for containers likely to perform intensive I/O activity. You will need to estimate or
measure an optimal memory limit value for that container.
What's next
Learn about API-initiated Eviction
Learn about Pod Priority and Preemption
Learn about PodDisruptionBudgets
Learn about Quality of Service (QoS)
Check out the Eviction API
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 64/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
13 - API-initiated Eviction
API-initiated eviction is the process by which you use the Eviction API to create an Eviction
object that triggers graceful pod termination.
You can request eviction by calling the Eviction API directly, or programmatically using a client
of the API server, like the kubectl drain command. This creates an Eviction object, which
causes the API server to terminate the Pod.
Using the API to create an Eviction object for a Pod is like performing a policy-controlled
DELETE operation on the Pod.
policy/v1 policy/v1beta1
{
"apiVersion": "policy/v1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}
Alternatively, you can attempt an eviction operation by accessing the API using curl or
wget , similar to the following example:
: the eviction is allowed, the Eviction subresource is created, and the Pod is
200 OK
deleted, similar to sending a DELETE request to the Pod URL.
429 Too Many Requests : the eviction is not currently allowed because of the configured
PodDisruptionBudget. You may be able to attempt the eviction again later. You might
also see this response because of API rate limiting.
500 Internal Server Error : the eviction is not allowed because there is a
misconfiguration, like if multiple PodDisruptionBudgets reference the same Pod.
If the Pod you want to evict isn't part of a workload that has a PodDisruptionBudget, the API
server always returns 200 OK and allows the eviction.
If the API server allows the eviction, the Pod is deleted as follows:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 65/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes
1. The Pod resource in the API server is updated with a deletion timestamp, after which
the API server considers the Pod resource to be terminated. The Pod resource is also
marked with the configured grace period.
2. The kubelet on the node where the local Pod is running notices that the Pod resource is
marked for termination and starts to gracefully shut down the local Pod.
3. While the kubelet is shutting the Pod down, the control plane removes the Pod from
Endpoint and EndpointSlice objects. As a result, controllers no longer consider the Pod
as a valid object.
4. After the grace period for the Pod expires, the kubelet forcefully terminates the local
Pod.
5. The kubelet tells the API server to remove the Pod resource.
6. The API server deletes the Pod resource.
Abort or pause the automated operation causing the issue. Investigate the stuck
application before you restart the operation.
Wait a while, then directly delete the Pod from your cluster control plane instead of
using the Eviction API.
What's next
Learn how to protect your applications with a Pod Disruption Budget.
Learn about Node-pressure Eviction.
Learn about Pod Priority and Preemption.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 66/66