100% found this document useful (1 vote)

183 views66 pages

Scheduling, Preemption and Eviction - Kubernetes

The document discusses scheduling, preemption, and eviction in Kubernetes. It defines that: - Scheduling refers to matching Pods to Nodes so that the kubelet can run them. - Preemption is the process of terminating lower priority Pods so that higher priority Pods can schedule on Nodes. - Eviction is the process of terminating one or more Pods on Nodes that are resource-starved.

Uploaded by

Fantahun Fkadie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

183 views66 pages

Scheduling, Preemption and Eviction - Kubernetes

Uploaded by

Fantahun Fkadie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 66

6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Scheduling, Preemption and Eviction

In Kubernetes, scheduling refers to making sure that Pods are matched to
Nodes so that the kubelet can run them. Preemption is the process of
terminating Pods with lower Priority so that Pods with higher Priority can
schedule on Nodes. Eviction is the process of proactively terminating one or
more Pods on resource-starved Nodes.

1: Kubernetes Scheduler
2: Assigning Pods to Nodes
3: Pod Overhead
4: Pod Scheduling Readiness
5: Pod Topology Spread Constraints
6: Taints and Tolerations
7: Scheduling Framework
8: Dynamic Resource Allocation
9: Scheduler Performance Tuning
10: Resource Bin Packing
11: Pod Priority and Preemption
12: Node-pressure Eviction
13: API-initiated Eviction

In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the
kubelet can run them. Preemption is the process of terminating Pods with lower Priority so
that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating
one or more Pods on Nodes.

Scheduling
Kubernetes Scheduler
Assigning Pods to Nodes
Pod Overhead
Pod Topology Spread Constraints
Taints and Tolerations
Scheduling Framework
Dynamic Resource Allocation
Scheduler Performance Tuning
Resource Bin Packing for Extended Resources
Pod Scheduling Readiness

Pod Disruption
Pod disruption is the process by which Pods on Nodes are terminated either voluntarily or
involuntarily.

Voluntary disruptions are started intentionally by application owners or cluster

administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable
issues like Nodes running out of resources, or by accidental deletions.

Pod Priority and Preemption

Node-pressure Eviction
API-initiated Eviction

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 1/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

1 - Kubernetes Scheduler
In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that
Kubelet can run them.

Scheduling overview
A scheduler watches for newly created Pods that have no Node assigned. For every Pod that
the scheduler discovers, the scheduler becomes responsible for finding the best Node for that
Pod to run on. The scheduler reaches this placement decision taking into account the
scheduling principles described below.

If you want to understand why Pods are placed onto a particular Node, or if you're planning to
implement a custom scheduler yourself, this page will help you learn about scheduling.

kube-scheduler
kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane.
kube-scheduler is designed so that, if you want and need to, you can write your own
scheduling component and use that instead.

Kube-scheduler selects an optimal node to run newly created or not yet scheduled
(unscheduled) pods. Since containers in pods - and pods themselves - can have different
requirements, the scheduler filters out any nodes that don't meet a Pod's specific scheduling
needs. Alternatively, the API lets you specify a node for a Pod when you create it, but this is
unusual and is only done in special cases.

In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes.
If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to
place it.

The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the
feasible Nodes and picks a Node with the highest score among the feasible ones to run the
Pod. The scheduler then notifies the API server about this decision in a process called binding.

Factors that need to be taken into account for scheduling decisions include individual and
collective resource requirements, hardware / software / policy constraints, affinity and anti-
affinity specifications, data locality, inter-workload interference, and so on.

Node selection in kube-scheduler

kube-scheduler selects a node for the pod in a 2-step operation:

1. Filtering
2. Scoring

The filtering step finds the set of Nodes where it's feasible to schedule the Pod. For example,
the PodFitsResources filter checks whether a candidate Node has enough available resource
to meet a Pod's specific resource requests. After this step, the node list contains any suitable
Nodes; often, there will be more than one. If the list is empty, that Pod isn't (yet) schedulable.

In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod
placement. The scheduler assigns a score to each Node that survived filtering, basing this
score on the active scoring rules.

Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more
than one node with equal scores, kube-scheduler selects one of these at random.

There are two supported ways to configure the filtering and scoring behavior of the
scheduler:

1. Scheduling Policies allow you to configure Predicates for filtering and Priorities for
scoring.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 2/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

2. Scheduling Profiles allow you to configure Plugins that implement different scheduling
stages, including: QueueSort , Filter , Score , Bind , Reserve , Permit , and others.
You can also configure the kube-scheduler to run different profiles.

What's next
Read about scheduler performance tuning
Read about Pod topology spread constraints
Read the reference documentation for kube-scheduler
Read the kube-scheduler config (v1beta3) reference
Learn about configuring multiple schedulers
Learn about topology management policies
Learn about Pod Overhead
Learn about scheduling of Pods that use volumes in:
Volume Topology Support
Storage Capacity Tracking
Node-specific Volume Limits

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 3/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

2 - Assigning Pods to Nodes

You can constrain a Pod so that it is restricted to run on particular node(s), or to prefer to run
on particular nodes. There are several ways to do this and the recommended approaches all
use label selectors to facilitate the selection. Often, you do not need to set any such
constraints; the scheduler will automatically do a reasonable placement (for example,
spreading your Pods across nodes so as not place Pods on a node with insufficient free
resources). However, there are some circumstances where you may want to control which
node the Pod deploys to, for example, to ensure that a Pod ends up on a node with an SSD
attached to it, or to co-locate Pods from two different services that communicate a lot into the
same availability zone.

You can use any of the following methods to choose where Kubernetes schedules specific
Pods:

nodeSelector field matching against node labels

Affinity and anti-affinity
nodeName field
Pod topology spread constraints

Node labels
Like many other Kubernetes objects, nodes have labels. You can attach labels manually.
Kubernetes also populates a standard set of labels on all nodes in a cluster. See Well-Known
Labels, Annotations and Taints for a list of common node labels.

Note: The value of these labels is cloud provider specific and is not guaranteed to be
reliable. For example, the value of kubernetes.io/hostname may be the same as the node
name in some environments and a different value in other environments.

Node isolation/restriction
Adding labels to nodes allows you to target Pods for scheduling on specific nodes or groups of
nodes. You can use this functionality to ensure that specific Pods only run on nodes with
certain isolation, security, or regulatory properties.

If you use labels for node isolation, choose label keys that the kubelet cannot modify. This
prevents a compromised node from setting those labels on itself so that the scheduler
schedules workloads onto the compromised node.

The NodeRestriction admission plugin prevents the kubelet from setting or modifying labels
with a node-restriction.kubernetes.io/ prefix.

To make use of that label prefix for node isolation:

1. Ensure you are using the Node authorizer and have enabled the NodeRestriction
admission plugin.
2. Add labels with the node-restriction.kubernetes.io/ prefix to your nodes, and use
those labels in your node selectors. For example, example.com.node-
restriction.kubernetes.io/fips=true or example.com.node-
restriction.kubernetes.io/pci-dss=true .

nodeSelector
nodeSelectoris the simplest recommended form of node selection constraint. You can add
the nodeSelector field to your Pod specification and specify the node labels you want the
target node to have. Kubernetes only schedules the Pod onto nodes that have each of the
labels you specify.

See Assign Pods to Nodes for more information.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 4/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Affinity and anti-affinity

nodeSelector is the simplest way to constrain Pods to nodes with specific labels. Affinity and
anti-affinity expands the types of constraints you can define. Some of the benefits of affinity
and anti-affinity include:

The affinity/anti-affinity language is more expressive. nodeSelector only selects nodes

with all the specified labels. Affinity/anti-affinity gives you more control over the
selection logic.
You can indicate that a rule is soft or preferred, so that the scheduler still schedules the
Pod even if it can't find a matching node.
You can constrain a Pod using labels on other Pods running on the node (or other
topological domain), instead of just node labels, which allows you to define rules for
which Pods can be co-located on a node.

The affinity feature consists of two types of affinity:

Node affinity functions like the nodeSelector field but is more expressive and allows you
to specify soft rules.
Inter-pod affinity/anti-affinity allows you to constrain Pods against labels on other Pods.

Node affinity
Node affinity is conceptually similar to nodeSelector , allowing you to constrain which nodes
your Pod can be scheduled on based on node labels. There are two types of node affinity:

requiredDuringSchedulingIgnoredDuringExecution : The scheduler can't schedule the

Pod unless the rule is met. This functions like nodeSelector , but with a more expressive
syntax.
preferredDuringSchedulingIgnoredDuringExecution : The scheduler tries to find a node
that meets the rule. If a matching node is not available, the scheduler still schedules the
Pod.

Note: In the preceding types, IgnoredDuringExecution means that if the node labels
change after Kubernetes schedules the Pod, the Pod continues to run.

You can specify node affinities using the .spec.affinity.nodeAffinity field in your Pod
spec.

For example, consider the following Pod spec:

pods/pod-with-node-affinity.yaml

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 5/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-east1
- antarctica-west1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0

In this example, the following rules apply:

The node must have a label with the key topology.kubernetes.io/zone and the value of
that label must be either antarctica-east1 or antarctica-west1 .
The node preferably has a label with the key another-node-label-key and the value
another-node-label-value .

You can use the operator field to specify a logical operator for Kubernetes to use when
interpreting the rules. You can use In , NotIn , Exists , DoesNotExist , Gt and Lt .

Read Operators to learn more about how these work.

and DoesNotExist allow you to define node anti-affinity behavior. Alternatively, you
NotIn
can use node taints to repel Pods from specific nodes.

Note:
If you specify both nodeSelector and nodeAffinity , both must be satisfied for the Pod
to be scheduled onto a node.

If you specify multiple terms in nodeSelectorTerms associated with nodeAffinity types,

then the Pod can be scheduled onto a node if one of the specified terms can be satisfied
(terms are ORed).

If you specify multiple expressions in a single matchExpressions field associated with a

term in nodeSelectorTerms , then the Pod can be scheduled onto a node only if all the
expressions are satisfied (expressions are ANDed).

See Assign Pods to Nodes using Node Affinity for more information.

Node affinity weight

You can specify a between 1 and 100 for each instance of the
weight
preferredDuringSchedulingIgnoredDuringExecution affinity type. When the scheduler finds
nodes that meet all the other scheduling requirements of the Pod, the scheduler iterates
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 6/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

through every preferred rule that the node satisfies and adds the value of the weight for
that expression to a sum.

The final sum is added to the score of other priority functions for the node. Nodes with the
highest total score are prioritized when the scheduler makes a scheduling decision for the
Pod.

For example, consider the following Pod spec:

pods/pod-with-affinity-anti-affinity.yaml

apiVersion: v1
kind: Pod
metadata:
name: with-affinity-anti-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: label-1
operator: In
values:
- key-1
- weight: 50
preference:
matchExpressions:
- key: label-2
operator: In
values:
- key-2
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0

If there are two possible nodes that match the

preferredDuringSchedulingIgnoredDuringExecution rule, one with the label-1:key-1 label
and another with the label-2:key-2 label, the scheduler considers the weight of each node
and adds the weight to the other scores for that node, and schedules the Pod onto the node
with the highest final score.

Note: If you want Kubernetes to successfully schedule the Pods in this example, you must
have existing nodes with the kubernetes.io/os=linux label.

Node affinity per scheduling profile

FEATURE STATE: Kubernetes v1.20 [beta]

When configuring multiple scheduling profiles, you can associate a profile with a node affinity,
which is useful if a profile only applies to a specific set of nodes. To do so, add an
addedAffinity to the args field of the NodeAffinity plugin in the scheduler configuration.
For example:
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 7/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration

profiles:
- schedulerName: default-scheduler
- schedulerName: foo-scheduler
pluginConfig:
- name: NodeAffinity
args:
addedAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: scheduler-profile
operator: In
values:
- foo

The addedAffinity is applied to all Pods that set .spec.schedulerName to foo-scheduler , in

addition to the NodeAffinity specified in the PodSpec. That is, in order to match the Pod,
nodes need to satisfy addedAffinity and the Pod's .spec.NodeAffinity .

Since the addedAffinity is not visible to end users, its behavior might be unexpected to
them. Use node labels that have a clear correlation to the scheduler profile name.

Note: The DaemonSet controller, which creates Pods for DaemonSets, does not support
scheduling profiles. When the DaemonSet controller creates Pods, the default Kubernetes
scheduler places those Pods and honors any nodeAffinity rules in the DaemonSet
controller.

Inter-pod affinity and anti-affinity

Inter-pod affinity and anti-affinity allow you to constrain which nodes your Pods can be
scheduled on based on the labels of Pods already running on that node, instead of the node
labels.

Inter-pod affinity and anti-affinity rules take the form "this Pod should (or, in the case of anti-
affinity, should not) run in an X if that X is already running one or more Pods that meet rule Y",
where X is a topology domain like node, rack, cloud provider zone or region, or similar and Y is
the rule Kubernetes tries to satisfy.

You express these rules (Y) as label selectors with an optional associated list of namespaces.
Pods are namespaced objects in Kubernetes, so Pod labels also implicitly have namespaces.
Any label selectors for Pod labels should specify the namespaces in which Kubernetes should
look for those labels.

You express the topology domain (X) using a topologyKey , which is the key for the node label
that the system uses to denote the domain. For examples, see Well-Known Labels,
Annotations and Taints.

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which
can slow down scheduling in large clusters significantly. We do not recommend using
them in clusters larger than several hundred nodes.

Note: Pod anti-affinity requires nodes to be consistently labelled, in other words, every
node in the cluster must have an appropriate label matching topologyKey. If some or all
nodes are missing the specified topologyKey label, it can lead to unintended behavior.

Types of inter-pod affinity and anti-affinity

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 8/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Similar to node affinity are two types of Pod affinity and anti-affinity as follows:

requiredDuringSchedulingIgnoredDuringExecution

preferredDuringSchedulingIgnoredDuringExecution

For example, you could use requiredDuringSchedulingIgnoredDuringExecution affinity to tell

the scheduler to co-locate Pods of two services in the same cloud provider zone because they
communicate with each other a lot. Similarly, you could use
preferredDuringSchedulingIgnoredDuringExecution anti-affinity to spread Pods from a
service across multiple cloud provider zones.

To use inter-pod affinity, use the affinity.podAffinity field in the Pod spec. For inter-pod
anti-affinity, use the affinity.podAntiAffinity field in the Pod spec.

Pod affinity example

Consider the following Pod spec:

pods/pod-with-pod-affinity.yaml

apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: registry.k8s.io/pause:2.0

This example defines one Pod affinity rule and one Pod anti-affinity rule. The Pod affinity rule
uses the "hard" requiredDuringSchedulingIgnoredDuringExecution , while the anti-affinity
rule uses the "soft" preferredDuringSchedulingIgnoredDuringExecution .

The affinity rule says that the scheduler can only schedule a Pod onto a node if the node is in
the same zone as one or more existing Pods with the label security=S1 . More precisely, the
scheduler must place the Pod on a node that has the topology.kubernetes.io/zone=V label,
as long as there is at least one node in that zone that currently has one or more Pods with the
Pod label security=S1 .

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 9/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

The anti-affinity rule says that the scheduler should try to avoid scheduling the Pod onto a
node that is in the same zone as one or more Pods with the label security=S2 . More
precisely, the scheduler should try to avoid placing the Pod on a node that has the
topology.kubernetes.io/zone=R label if there are other nodes in the same zone currently
running Pods with the Security=S2 Pod label.

To get yourself more familiar with the examples of Pod affinity and anti-affinity, refer to the
design proposal.

You can use the In , NotIn , Exists and DoesNotExist values in the operator field for Pod
affinity and anti-affinity.

Read Operators to learn more about how these work.

In principle, the topologyKey can be any allowed label key with the following exceptions for
performance and security reasons:

For Pod affinity and anti-affinity, an empty topologyKey field is not allowed in both
requiredDuringSchedulingIgnoredDuringExecution and
preferredDuringSchedulingIgnoredDuringExecution .

For requiredDuringSchedulingIgnoredDuringExecution Pod anti-affinity rules, the

admission controller LimitPodHardAntiAffinityTopology limits topologyKey to
kubernetes.io/hostname . You can modify or disable the admission controller if you
want to allow custom topologies.

In addition to labelSelector and topologyKey , you can optionally specify a list of

namespaces which the labelSelector should match against using the namespaces field at
the same level as labelSelector and topologyKey . If omitted or empty, namespaces
defaults to the namespace of the Pod where the affinity/anti-affinity definition appears.

Namespace selector
FEATURE STATE: Kubernetes v1.24 [stable]

You can also select matching namespaces using namespaceSelector , which is a label query
over the set of namespaces. The affinity term is applied to namespaces selected by both
namespaceSelector and the namespaces field. Note that an empty namespaceSelector ({})
matches all namespaces, while a null or empty namespaces list and null namespaceSelector
matches the namespace of the Pod where the rule is defined.

More practical use-cases

Inter-pod affinity and anti-affinity can be even more useful when they are used with higher
level collections such as ReplicaSets, StatefulSets, Deployments, etc. These rules allow you to
configure that a set of workloads should be co-located in the same defined topology; for
example, preferring to place two related Pods onto the same node.

For example: imagine a three-node cluster. You use the cluster to run a web application and
also an in-memory cache (such as Redis). For this example, also assume that latency between
the web application and the memory cache should be as low as is practical. You could use
inter-pod affinity and anti-affinity to co-locate the web servers with the cache as much as
possible.

In the following example Deployment for the Redis cache, the replicas get the label
app=store . The podAntiAffinity rule tells the scheduler to avoid placing multiple replicas
with the app=store label on a single node. This creates each cache in a separate node.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 10/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine

The following example Deployment for the web servers creates replicas with the label
app=web-store . The Pod affinity rule tells the scheduler to place each replica on a node that
has a Pod with the label app=store . The Pod anti-affinity rule tells the scheduler never to
place multiple app=web-store servers on a single node.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 11/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine

Creating the two preceding Deployments results in the following cluster layout, where each
web server is co-located with a cache, on three separate nodes.

node-1 node-2 node-3

webserver-1 webserver-2 webserver-3

cache-1 cache-2 cache-3

The overall effect is that each cache instance is likely to be accessed by a single client, that is
running on the same node. This approach aims to minimize both skew (imbalanced load) and
latency.

You might have other reasons to use Pod anti-affinity. See the ZooKeeper tutorial for an
example of a StatefulSet configured with anti-affinity for high availability, using the same
technique as this example.

nodeName
nodeName is a more direct form of node selection than affinity or nodeSelector . nodeName is
a field in the Pod spec. If the nodeName field is not empty, the scheduler ignores the Pod and
the kubelet on the named node tries to place the Pod on that node. Using nodeName
overrules using nodeSelector or affinity and anti-affinity rules.

Some of the limitations of using nodeName to select nodes are:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 12/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

If the named node does not exist, the Pod will not run, and in some cases may be
automatically deleted.
If the named node does not have the resources to accommodate the Pod, the Pod will
fail and its reason will indicate why, for example OutOfmemory or OutOfcpu.
Node names in cloud environments are not always predictable or stable.

Note: nodeName is intended for use by custom schedulers or advanced use cases where
you need to bypass any configured schedulers. Bypassing the schedulers might lead to
failed Pods if the assigned Nodes get oversubscribed. You can use node affinity or a the
nodeselector field to assign a Pod to a specific Node without bypassing the schedulers.

Here is an example of a Pod spec using the nodeName field:

apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01

The above Pod will only run on the node kube-01 .

Pod topology spread constraints

You can use topology spread constraints to control how Pods are spread across your cluster
among failure-domains such as regions, zones, nodes, or among any other topology domains
that you define. You might do this to improve performance, expected availability, or overall
utilization.

Read Pod topology spread constraints to learn more about how these work.

Operators
The following are all the logical operators that you can use in the operator field for
nodeAffinity and podAffinity mentioned above.

Operator Behavior

In The label value is present in the supplied set of strings

NotIn The label value is not contained in the supplied set of strings

Exists A label with this key exists on the object

DoesNotExist No label with this key exists on the object

The following operators can only be used with nodeAffinity .

Operator Behaviour

Gt The supplied value will be parsed as an integer, and that integer is less than
or equal to the integer that results from parsing the value of a label named by
this selector
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 13/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Operator Behaviour

Lt The supplied value will be parsed as an integer, and that integer is greater
than or equal to the integer that results from parsing the value of a label
named by this selector

Note: Gt and Lt operators will not work with non-integer values. If the given value doesn't
parse as an integer, the pod will fail to get scheduled. Also, Gt and Lt are not available for
podAffinity.

What's next
Read more about taints and tolerations .
Read the design docs for node affinity and for inter-pod affinity/anti-affinity.
Learn about how the topology manager takes part in node-level resource allocation
decisions.
Learn how to use nodeSelector.
Learn how to use affinity and anti-affinity.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 14/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

3 - Pod Overhead
FEATURE STATE: Kubernetes v1.24 [stable]

When you run a Pod on a Node, the Pod itself takes an amount of system resources. These
resources are additional to the resources needed to run the container(s) inside the Pod. In
Kubernetes, Pod Overhead is a way to account for the resources consumed by the Pod
infrastructure on top of the container requests & limits.

In Kubernetes, the Pod's overhead is set at admission time according to the overhead
associated with the Pod's RuntimeClass.

A pod's overhead is considered in addition to the sum of container resource requests when
scheduling a Pod. Similarly, the kubelet will include the Pod overhead when sizing the Pod
cgroup, and when carrying out Pod eviction ranking.

Configuring Pod overhead

You need to make sure a RuntimeClass is utilized which defines the overhead field.

Usage example
To work with Pod overhead, you need a RuntimeClass that defines the overhead field. As an
example, you could use the following RuntimeClass definition with a virtualization container
runtime that uses around 120MiB per Pod for the virtual machine and the guest OS:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: kata-fc
handler: kata-fc
overhead:
podFixed:
memory: "120Mi"
cpu: "250m"

Workloads which are created which specify the kata-fc RuntimeClass handler will take the
memory and cpu overheads into account for resource quota calculations, node scheduling, as
well as Pod cgroup sizing.

Consider running the given example workload, test-pod:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 15/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
runtimeClassName: kata-fc
containers:
- name: busybox-ctr
image: busybox:1.28
stdin: true
tty: true
resources:
limits:
cpu: 500m
memory: 100Mi
- name: nginx-ctr
image: nginx
resources:
limits:
cpu: 1500m
memory: 100Mi

At admission time the RuntimeClass admission controller updates the workload's PodSpec to
include the overhead as described in the RuntimeClass. If the PodSpec already has this field
defined, the Pod will be rejected. In the given example, since only the RuntimeClass name is
specified, the admission controller mutates the Pod to include an overhead .

After the RuntimeClass admission controller has made modifications, you can check the
updated Pod overhead value:

kubectl get pod test-pod -o jsonpath='{.spec.overhead}'

The output is:

map[cpu:250m memory:120Mi]

If a ResourceQuota is defined, the sum of container requests as well as the overhead field
are counted.

When the kube-scheduler is deciding which node should run a new Pod, the scheduler
considers that Pod's overhead as well as the sum of container requests for that Pod. For this
example, the scheduler adds the requests and the overhead, then looks for a node that has
2.25 CPU and 320 MiB of memory available.

Once a Pod is scheduled to a node, the kubelet on that node creates a new cgroup for the
Pod. It is within this pod that the underlying container runtime will create containers.

If the resource has a limit defined for each container (Guaranteed QoS or Burstable QoS with
limits defined), the kubelet will set an upper limit for the pod cgroup associated with that
resource (cpu.cfs_quota_us for CPU and memory.limit_in_bytes memory). This upper limit is
based on the sum of the container limits plus the overhead defined in the PodSpec.

For CPU, if the Pod is Guaranteed or Burstable QoS, the kubelet will set cpu.shares based on
the sum of container requests plus the overhead defined in the PodSpec.

Looking at our example, verify the container requests for the workload:

kubectl get pod test-pod -o jsonpath='{.spec.containers[*].resources.limits}'

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 16/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

The total container requests are 2000m CPU and 200MiB of memory:

map[cpu: 500m memory:100Mi] map[cpu:1500m memory:100Mi]

Check this against what is observed by the node:

kubectl describe node | grep test-pod -B2

The output shows requests for 2250m CPU, and for 320MiB of memory. The requests include
Pod overhead:

Namespace Name CPU Requests CPU Limits Memory Requests Memory Limi
--------- ---- ------------ ---------- --------------- -----------
default test-pod 2250m (56%) 2250m (56%) 320Mi (1%) 320Mi (1%)

Verify Pod cgroup limits

Check the Pod's memory cgroups on the node where the workload is running. In the following
example, crictl is used on the node, which provides a CLI for CRI-compatible container
runtimes. This is an advanced example to show Pod overhead behavior, and it is not expected
that users should need to check cgroups directly on the node.

First, on the particular node, determine the Pod identifier:

# Run this on the node where the Pod is scheduled

POD_ID="$(sudo crictl pods --name test-pod -q)"

From this, you can determine the cgroup path for the Pod:

# Run this on the node where the Pod is scheduled

sudo crictl inspectp -o=json $POD_ID | grep cgroupsPath

The resulting cgroup path includes the Pod's pause container. The Pod level cgroup is one
directory above.

"cgroupsPath": "/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/7ccf55aee35dd

In this specific case, the pod cgroup path is kubepods/podd7f4b509-cf94-4951-9417-

d1087c92a5b2 . Verify the Pod level cgroup setting for memory:

# Run this on the node where the Pod is scheduled.

# Also, change the name of the cgroup to match the cgroup allocated for your pod.
cat /sys/fs/cgroup/memory/kubepods/podd7f4b509-cf94-4951-9417-d1087c92a5b2/memor

This is 320 MiB, as expected:

335544320

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 17/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Observability
Some kube_pod_overhead_* metrics are available in kube-state-metrics to help identify when
Pod overhead is being utilized and to help observe stability of workloads running with a
defined overhead.

What's next
Learn more about RuntimeClass
Read the PodOverhead Design enhancement proposal for extra context

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 18/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

4 - Pod Scheduling Readiness

FEATURE STATE: Kubernetes v1.26 [alpha]

Pods were considered ready for scheduling once created. Kubernetes scheduler does its due
diligence to find nodes to place all pending Pods. However, in a real-world case, some Pods
may stay in a "miss-essential-resources" state for a long period. These Pods actually churn the
scheduler (and downstream integrators like Cluster AutoScaler) in an unnecessary manner.

By specifying/removing a Pod's .spec.schedulingGates , you can control when a Pod is ready

to be considered for scheduling.

Configuring Pod schedulingGates

The schedulingGates field contains a list of strings, and each string literal is perceived as a
criteria that Pod should be satisfied before considered schedulable. This field can be
initialized only when a Pod is created (either by the client, or mutated during admission). After
creation, each schedulingGate can be removed in arbitrary order, but addition of a new
scheduling gate is disallowed.

pod created

empty scheduling gates?

no scheduling gate removed yes

pod scheduling gated pod scheduling ready

pod running

Figure. Pod SchedulingGates

Usage example
To mark a Pod not-ready for scheduling, you can create it with one or more scheduling gates
like this:

pods/pod-with-scheduling-gates.yaml

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 19/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
schedulingGates:
- name: example.com/foo
- name: example.com/bar
containers:
- name: pause
image: registry.k8s.io/pause:3.6

After the Pod's creation, you can check its state using:

kubectl get pod test-pod

The output reveals it's in SchedulingGated state:

NAME READY STATUS RESTARTS AGE

test-pod 0/1 SchedulingGated 0 7s

You can also check its schedulingGates field by running:

kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'

The output is:

[{"name":"example.com/foo"},{"name":"example.com/bar"}]

To inform scheduler this Pod is ready for scheduling, you can remove its schedulingGates
entirely by re-applying a modified manifest:

pods/pod-without-scheduling-gates.yaml

apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: pause
image: registry.k8s.io/pause:3.6

You can check if the schedulingGates is cleared by running:

kubectl get pod test-pod -o jsonpath='{.spec.schedulingGates}'

The output is expected to be empty. And you can check its latest status by running:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 20/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

kubectl get pod test-pod -o wide

Given the test-pod doesn't request any CPU/memory resources, it's expected that this Pod's
state get transited from previous SchedulingGated to Running :

NAME READY STATUS RESTARTS AGE IP NODE

test-pod 1/1 Running 0 15s 10.0.0.4 node-2

Observability
The metric scheduler_pending_pods comes with a new label "gated" to distinguish whether
a Pod has been tried scheduling but claimed as unschedulable, or explicitly marked as not
ready for scheduling. You can use scheduler_pending_pods{queue="gated"} to check the
metric result.

Mutable Pod Scheduling Directives

FEATURE STATE: Kubernetes v1.27 [beta]

You can mutate scheduling directives of Pods while they have scheduling gates, with certain
constraints. At a high level, you can only tighten the scheduling directives of a Pod. In other
words, the updated directives would cause the Pods to only be able to be scheduled on a
subset of the nodes that it would previously match. More concretely, the rules for updating a
Pod's scheduling directives are as follows:

1. For .spec.nodeSelector , only additions are allowed. If absent, it will be allowed to be

set.

2. For spec.affinity.nodeAffinity , if nil, then setting anything is allowed.

3. If NodeSelectorTerms was empty, it will be allowed to be set. If not empty, then only
additions of NodeSelectorRequirements to matchExpressions or fieldExpressions are
allowed, and no changes to existing matchExpressions and fieldExpressions will be
allowed. This is because the terms in
.requiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms , are ORed
while the expressions in nodeSelectorTerms[].matchExpressions and
nodeSelectorTerms[].fieldExpressions are ANDed.

4. For .preferredDuringSchedulingIgnoredDuringExecution , all updates are allowed. This

is because preferred terms are not authoritative, and so policy controllers don't validate
those terms.

What's next
Read the PodSchedulingReadiness KEP for more details

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 21/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

5 - Pod Topology Spread Constraints

You can use topology spread constraints to control how Pods are spread across your cluster
among failure-domains such as regions, zones, nodes, and other user-defined topology
domains. This can help to achieve high availability as well as efficient resource utilization.

You can set cluster-level constraints as a default, or configure topology spread constraints for
individual workloads.

Motivation
Imagine that you have a cluster of up to twenty nodes, and you want to run a workload that
automatically scales how many replicas it uses. There could be as few as two Pods or as many
as fifteen. When there are only two Pods, you'd prefer not to have both of those Pods run on
the same node: you would run the risk that a single node failure takes your workload offline.

In addition to this basic usage, there are some advanced usage examples that enable your
workloads to benefit on high availability and cluster utilization.

As you scale up and run more Pods, a different concern becomes important. Imagine that you
have three nodes running five Pods each. The nodes have enough capacity to run that many
replicas; however, the clients that interact with this workload are split across three different
datacenters (or infrastructure zones). Now you have less concern about a single node failure,
but you notice that latency is higher than you'd like, and you are paying for network costs
associated with sending network traffic between the different zones.

You decide that under normal operation you'd prefer to have a similar number of replicas
scheduled into each infrastructure zone, and you'd like the cluster to self-heal in the case that
there is a problem.

Pod topology spread constraints offer you a declarative way to configure that.

topologySpreadConstraints field
The Pod API includes a field, spec.topologySpreadConstraints . The usage of this field looks
like the following:

---
apiVersion: v1
kind: Pod
metadata:
name: example-pod
spec:
# Configure a topology spread constraint
topologySpreadConstraints:
- maxSkew: <integer>
minDomains: <integer> # optional; beta since v1.25
topologyKey: <string>
whenUnsatisfiable: <string>
labelSelector: <object>
matchLabelKeys: <list> # optional; beta since v1.27
nodeAffinityPolicy: [Honor|Ignore] # optional; beta since v1.26
nodeTaintsPolicy: [Honor|Ignore] # optional; beta since v1.26
### other Pod fields go here

You can read more about this field by running kubectl explain
Pod.spec.topologySpreadConstraints or refer to scheduling section of the API reference for
Pod.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 22/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Spread constraint definition

You can define one or multiple topologySpreadConstraints entries to instruct the kube-
scheduler how to place each incoming Pod in relation to the existing Pods across your cluster.
Those fields are:

maxSkew describes the degree to which Pods may be unevenly distributed. You must
specify this field and the number must be greater than zero. Its semantics differ
according to the value of whenUnsatisfiable :

if you select whenUnsatisfiable: DoNotSchedule , then maxSkew defines the

maximum permitted difference between the number of matching pods in the
target topology and the global minimum (the minimum number of matching pods
in an eligible domain or zero if the number of eligible domains is less than
MinDomains). For example, if you have 3 zones with 2, 2 and 1 matching pods
respectively, MaxSkew is set to 1 then the global minimum is 1.
if you select whenUnsatisfiable: ScheduleAnyway , the scheduler gives higher
precedence to topologies that would help reduce the skew.
minDomains indicates a minimum number of eligible domains. This field is optional. A
domain is a particular instance of a topology. An eligible domain is a domain whose
nodes match the node selector.

Note: The minDomains field is a beta field and disabled by default in 1.25. You can
enable it by enabling the MinDomainsInPodTopologySpread feature gate.

The value of minDomains must be greater than 0, when specified. You can only
specify minDomains in conjunction with whenUnsatisfiable: DoNotSchedule .
When the number of eligible domains with match topology keys is less than
minDomains , Pod topology spread treats global minimum as 0, and then the
calculation of skew is performed. The global minimum is the minimum number of
matching Pods in an eligible domain, or zero if the number of eligible domains is
less than minDomains .
When the number of eligible domains with matching topology keys equals or is
greater than minDomains , this value has no effect on scheduling.
If you do not specify minDomains , the constraint behaves as if minDomains is 1.
topologyKey is the key of node labels. Nodes that have a label with this key and
identical values are considered to be in the same topology. We call each instance of a
topology (in other words, a <key, value> pair) a domain. The scheduler will try to put a
balanced number of pods into each domain. Also, we define an eligible domain as a
domain whose nodes meet the requirements of nodeAffinityPolicy and
nodeTaintsPolicy.

whenUnsatisfiable indicates how to deal with a Pod if it doesn't satisfy the spread
constraint:

(default) tells the scheduler not to schedule it.

DoNotSchedule

ScheduleAnyway tells the scheduler to still schedule it while prioritizing nodes that
minimize the skew.
labelSelector is used to find matching Pods. Pods that match this label selector are
counted to determine the number of Pods in their corresponding topology domain. See
Label Selectors for more details.

matchLabelKeys is a list of pod label keys to select the pods over which spreading will
be calculated. The keys are used to lookup values from the pod labels, those key-value
labels are ANDed with labelSelector to select the group of existing pods over which
spreading will be calculated for the incoming pod. The same key is forbidden to exist in
both matchLabelKeys and labelSelector . matchLabelKeys cannot be set when
labelSelector isn't set. Keys that don't exist in the pod labels will be ignored. A null or
empty list means only match against the labelSelector .

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 23/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

With matchLabelKeys , you don't need to update the pod.spec between different
revisions. The controller/operator just needs to set different values to the same label key
for different revisions. The scheduler will assume the values automatically based on
matchLabelKeys . For example, if you are configuring a Deployment, you can use the
label keyed with pod-template-hash, which is added automatically by the Deployment
controller, to distinguish between different revisions in a single Deployment.

topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: foo
matchLabelKeys:
- pod-template-hash

Note: The matchLabelKeys field is a beta-level field and enabled by default in 1.27.
You can disable it by disabling the MatchLabelKeysInPodTopologySpread feature
gate.
nodeAffinityPolicy indicates how we will treat Pod's nodeAffinity/nodeSelector when
calculating pod topology spread skew. Options are:

Honor: only nodes matching nodeAffinity/nodeSelector are included in the

calculations.
Ignore: nodeAffinity/nodeSelector are ignored. All nodes are included in the
calculations.
If this value is null, the behavior is equivalent to the Honor policy.

Note: The nodeAffinityPolicy is a beta-level field and enabled by default in 1.26.

You can disable it by disabling the NodeInclusionPolicyInPodTopologySpread
feature gate.

nodeTaintsPolicy indicates how we will treat node taints when calculating pod topology
spread skew. Options are:

Honor: nodes without taints, along with tainted nodes for which the incoming pod
has a toleration, are included.
Ignore: node taints are ignored. All nodes are included.
If this value is null, the behavior is equivalent to the Ignore policy.

Note: The nodeTaintsPolicy is a beta-level field and enabled by default in 1.26. You
can disable it by disabling the NodeInclusionPolicyInPodTopologySpread feature
gate.

When a Pod defines more than one topologySpreadConstraint , those constraints are
combined using a logical AND operation: the kube-scheduler looks for a node for the
incoming Pod that satisfies all the configured constraints.

Node labels
Topology spread constraints rely on node labels to identify the topology domain(s) that each
node is in. For example, a node might have labels:

region: us-east-1
zone: us-east-1a

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 24/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Note:
For brevity, this example doesn't use the well-known label keys
topology.kubernetes.io/zone and topology.kubernetes.io/region . However, those
registered label keys are nonetheless recommended rather than the private (unqualified)
label keys region and zone that are used here.

You can't make a reliable assumption about the meaning of a private label key between
different contexts.

Suppose you have a 4-node cluster with the following labels:

NAME STATUS ROLES AGE VERSION LABELS

node1 Ready <none> 4m26s v1.16.0 node=node1,zone=zoneA
node2 Ready <none> 3m58s v1.16.0 node=node2,zone=zoneA
node3 Ready <none> 3m17s v1.16.0 node=node3,zone=zoneB
node4 Ready <none> 2m43s v1.16.0 node=node4,zone=zoneB

Then the cluster is logically viewed as below:

zoneA zoneB

Node1 Node2 Node3 Node4

Consistency
You should set the same Pod topology spread constraints on all pods in a group.

Usually, if you are using a workload controller such as a Deployment, the pod template takes
care of this for you. If you mix different spread constraints then Kubernetes follows the API
definition of the field; however, the behavior is more likely to become confusing and
troubleshooting is less straightforward.

You need a mechanism to ensure that all the nodes in a topology domain (such as a cloud
provider region) are labelled consistently. To avoid you needing to manually label nodes, most
clusters automatically populate well-known labels such as kubernetes.io/hostname . Check
whether your cluster supports this.

Topology spread constraint examples

Example: one topology spread constraint
Suppose you have a 4-node cluster where 3 Pods labelled foo: bar are located in node1,
node2 and node3 respectively:

zoneA zoneB

Node1 Node2 Node3 Node4

Pod Pod Pod

If you want an incoming Pod to be evenly spread with existing Pods across zones, you can use
a manifest similar to:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 25/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

pods/topology-spread-constraints/one-constraint.yaml

From that manifest, topologyKey: zone implies the even distribution will only be applied to
nodes that are labelled zone: <any value> (nodes that don't have a zone label are skipped).
The field whenUnsatisfiable: DoNotSchedule tells the scheduler to let the incoming Pod stay
pending if the scheduler can't find a way to satisfy the constraint.

If the scheduler placed this incoming Pod into zone A , the distribution of Pods would become
[3, 1] . That means the actual skew is then 2 (calculated as 3 - 1 ), which violates maxSkew:
1 . To satisfy the constraints and context for this example, the incoming Pod can only be
placed onto a node in zone B :

zoneA zoneB

Node1 Node2 Node3 Node4

Pod Pod Pod mypod

zoneA zoneB

Node1 Node2 Node3 Node4

Pod Pod Pod mypod

You can tweak the Pod spec to meet various kinds of requirements:

Change maxSkew to a bigger value - such as 2 - so that the incoming Pod can be placed
into zone A as well.
Change topologyKey to node so as to distribute the Pods evenly across nodes instead
of zones. In the above example, if maxSkew remains 1 , the incoming Pod can only be
placed onto the node node4 .

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 26/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Change whenUnsatisfiable: DoNotSchedule to whenUnsatisfiable: ScheduleAnyway to

ensure the incoming Pod to be always schedulable (suppose other scheduling APIs are
satisfied). However, it's preferred to be placed into the topology domain which has fewer
matching Pods. (Be aware that this preference is jointly normalized with other internal
scheduling priorities such as resource usage ratio).

Example: multiple topology spread constraints

This builds upon the previous example. Suppose you have a 4-node cluster where 3 existing
Pods labeled foo: bar are located on node1, node2 and node3 respectively:

zoneA zoneB

Node1 Node2 Node3 Node4

Pod Pod Pod

You can combine two topology spread constraints to control the spread of Pods both by node
and by zone:

pods/topology-spread-constraints/two-constraints.yaml

kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
- maxSkew: 1
topologyKey: node
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
containers:
- name: pause
image: registry.k8s.io/pause:3.1

In this case, to match the first constraint, the incoming Pod can only be placed onto nodes in
zone B ; while in terms of the second constraint, the incoming Pod can only be scheduled to
the node node4 . The scheduler only considers options that satisfy all defined constraints, so
the only valid placement is onto node node4 .

Example: conflicting topology spread constraints

Multiple constraints can lead to conflicts. Suppose you have a 3-node cluster across 2 zones:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 27/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

zoneA zoneB

Node1 Node2 Node3

Pod Pod Pod Pod Pod

If you were to apply two-constraints.yaml (the manifest from the previous example) to this
cluster, you would see that the Pod mypod stays in the Pending state. This happens because:
to satisfy the first constraint, the Pod mypod can only be placed into zone B ; while in terms
of the second constraint, the Pod mypod can only schedule to node node2 . The intersection
of the two constraints returns an empty set, and the scheduler cannot place the Pod.

To overcome this situation, you can either increase the value of maxSkew or modify one of the
constraints to use whenUnsatisfiable: ScheduleAnyway . Depending on circumstances, you
might also decide to delete an existing Pod manually - for example, if you are troubleshooting
why a bug-fix rollout is not making progress.

Interaction with node affinity and node selectors

The scheduler will skip the non-matching nodes from the skew calculations if the incoming
Pod has spec.nodeSelector or spec.affinity.nodeAffinity defined.

Example: topology spread constraints with node affinity

Suppose you have a 5-node cluster ranging across zones A to C:

zoneA zoneB

Node1 Node2 Node3 Node4

Pod Pod Pod

zoneC

Node5

and you know that zone C must be excluded. In this case, you can compose a manifest as
below, so that Pod mypod will be placed into zone B instead of zone C . Similarly,
Kubernetes also respects spec.nodeSelector .

pods/topology-spread-constraints/one-constraint-with-nodeaffinity.yaml

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 28/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

kind: Pod
apiVersion: v1
metadata:
name: mypod
labels:
foo: bar
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
foo: bar
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: zone
operator: NotIn
values:
- zoneC
containers:
- name: pause
image: registry.k8s.io/pause:3.1

Implicit conventions
There are some implicit conventions worth noting here:

Only the Pods holding the same namespace as the incoming Pod can be matching
candidates.

The scheduler bypasses any nodes that don't have any

topologySpreadConstraints[*].topologyKey present. This implies that:

1. any Pods located on those bypassed nodes do not impact maxSkew calculation - in
the above example, suppose the node node1 does not have a label "zone", then
the 2 Pods will be disregarded, hence the incoming Pod will be scheduled into zone
A .

2. the incoming Pod has no chances to be scheduled onto this kind of nodes - in the
above example, suppose a node node5 has the mistyped label zone-typo: zoneC
(and no zone label set). After node node5 joins the cluster, it will be bypassed and
Pods for this workload aren't scheduled there.
Be aware of what will happen if the incoming Pod's
topologySpreadConstraints[*].labelSelector doesn't match its own labels. In the
above example, if you remove the incoming Pod's labels, it can still be placed onto nodes
in zone B , since the constraints are still satisfied. However, after that placement, the
degree of imbalance of the cluster remains unchanged - it's still zone A having 2 Pods
labelled as foo: bar , and zone B having 1 Pod labelled as foo: bar . If this is not what
you expect, update the workload's topologySpreadConstraints[*].labelSelector to
match the labels in the pod template.

Cluster-level default constraints

It is possible to set default topology spread constraints for a cluster. Default topology spread
constraints are applied to a Pod if, and only if:

It doesn't define any constraints in its .spec.topologySpreadConstraints .

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 29/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

It belongs to a Service, ReplicaSet, StatefulSet or ReplicationController.

Default constraints can be set as part of the PodTopologySpread plugin arguments in a

scheduling profile. The constraints are specified with the same API above, except that
labelSelector must be empty. The selectors are calculated from the Services, ReplicaSets,
StatefulSets or ReplicationControllers that the Pod belongs to.

An example configuration might look like follows:

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration

profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: List

Note: The SelectorSpread plugin is disabled by default. The Kubernetes project

recommends using PodTopologySpread to achieve similar behavior.

Built-in default constraints

FEATURE STATE: Kubernetes v1.24 [stable]

If you don't configure any cluster-level default constraints for pod topology spreading, then
kube-scheduler acts as if you specified the following default topology constraints:

defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway

Also, the legacy SelectorSpread plugin, which provides an equivalent behavior, is disabled by
default.

Note:
The PodTopologySpread plugin does not score the nodes that don't have the topology
keys specified in the spreading constraints. This might result in a different default
behavior compared to the legacy SelectorSpread plugin when using the default topology
constraints.

If your nodes are not expected to have both kubernetes.io/hostname and

topology.kubernetes.io/zone labels set, define your own constraints instead of using
the Kubernetes defaults.

If you don't want to use the default Pod spreading constraints for your cluster, you can
disable those defaults by setting defaultingType to List and leaving empty
defaultConstraints in the PodTopologySpread plugin configuration:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 30/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration

profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints: []
defaultingType: List

Comparison with podAffinity and

podAntiAffinity
In Kubernetes, inter-Pod affinity and anti-affinity control how Pods are scheduled in relation
to one another - either more packed or more scattered.

podAffinity

attracts Pods; you can try to pack any number of Pods into qualifying topology domain(s).

podAntiAffinity

repels Pods. If you set this to requiredDuringSchedulingIgnoredDuringExecution mode

then only a single Pod can be scheduled into a single topology domain; if you choose
preferredDuringSchedulingIgnoredDuringExecution then you lose the ability to enforce
the constraint.

For finer control, you can specify topology spread constraints to distribute Pods across
different topology domains - to achieve either high availability or cost-saving. This can also
help on rolling update workloads and scaling out replicas smoothly.

For more context, see the Motivation section of the enhancement proposal about Pod
topology spread constraints.

Known limitations
There's no guarantee that the constraints remain satisfied when Pods are removed. For
example, scaling down a Deployment may result in imbalanced Pods distribution.

You can use a tool such as the Descheduler to rebalance the Pods distribution.

Pods matched on tainted nodes are respected. See Issue 80921.

The scheduler doesn't have prior knowledge of all the zones or other topology domains
that a cluster has. They are determined from the existing nodes in the cluster. This could
lead to a problem in autoscaled clusters, when a node pool (or node group) is scaled to
zero nodes, and you're expecting the cluster to scale up, because, in this case, those
topology domains won't be considered until there is at least one node in them.

You can work around this by using an cluster autoscaling tool that is aware of Pod
topology spread constraints and is also aware of the overall set of topology domains.

What's next
The blog article Introducing PodTopologySpread explains maxSkew in some detail, as
well as covering some advanced usage examples.
Read the scheduling section of the API reference for Pod.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 31/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

6 - Taints and Tolerations

Node affinity is a property of Pods that attracts them to a set of nodes (either as a preference
or a hard requirement). Taints are the opposite -- they allow a node to repel a set of pods.

Tolerations are applied to pods. Tolerations allow the scheduler to schedule pods with
matching taints. Tolerations allow scheduling but don't guarantee scheduling: the scheduler
also evaluates other parameters as part of its function.

Taints and tolerations work together to ensure that pods are not scheduled onto
inappropriate nodes. One or more taints are applied to a node; this marks that the node
should not accept any pods that do not tolerate the taints.

Concepts
You add a taint to a node using kubectl taint. For example,

kubectl taint nodes node1 key1=value1:NoSchedule

places a taint on node node1 . The taint has key key1 , value value1 , and taint effect
NoSchedule . This means that no pod will be able to schedule onto node1 unless it has a
matching toleration.

To remove the taint added by the command above, you can run:

kubectl taint nodes node1 key1=value1:NoSchedule-

You specify a toleration for a pod in the PodSpec. Both of the following tolerations "match"
the taint created by the kubectl taint line above, and thus a pod with either toleration
would be able to schedule onto node1 :

tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"

tolerations:
- key: "key1"
operator: "Exists"
effect: "NoSchedule"

Here's an example of a pod that uses tolerations:

pods/pod-with-toleration.yaml

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 32/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"

The default value for operator is Equal .

A toleration "matches" a taint if the keys are the same and the effects are the same, and:

the operator is Exists (in which case no value should be specified), or

the operator is Equal and the value s are equal.

Note:
There are two special cases:

An empty key with operator Exists matches all keys, values and effects which means
this will tolerate everything.

An empty effect matches all effects with key key1 .

The above example used effect of NoSchedule . Alternatively, you can use effect of
PreferNoSchedule . This is a "preference" or "soft" version of NoSchedule -- the system will try
to avoid placing a pod that does not tolerate the taint on the node, but it is not required. The
third kind of effect is NoExecute , described later.

You can put multiple taints on the same node and multiple tolerations on the same pod. The
way Kubernetes processes multiple taints and tolerations is like a filter: start with all of a
node's taints, then ignore the ones for which the pod has a matching toleration; the
remaining un-ignored taints have the indicated effects on the pod. In particular,

if there is at least one un-ignored taint with effect NoSchedule then Kubernetes will not
schedule the pod onto that node
if there is no un-ignored taint with effect NoSchedule but there is at least one un-
ignored taint with effect PreferNoSchedule then Kubernetes will try to not schedule the
pod onto the node
if there is at least one un-ignored taint with effect NoExecute then the pod will be
evicted from the node (if it is already running on the node), and will not be scheduled
onto the node (if it is not yet running on the node).

For example, imagine you taint a node like this

kubectl taint nodes node1 key1=value1:NoSchedule

kubectl taint nodes node1 key1=value1:NoExecute
kubectl taint nodes node1 key2=value2:NoSchedule

And a pod has two tolerations:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 33/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"

In this case, the pod will not be able to schedule onto the node, because there is no toleration
matching the third taint. But it will be able to continue running if it is already running on the
node when the taint is added, because the third taint is the only one of the three that is not
tolerated by the pod.

Normally, if a taint with effect NoExecute is added to a node, then any pods that do not
tolerate the taint will be evicted immediately, and pods that do tolerate the taint will never be
evicted. However, a toleration with NoExecute effect can specify an optional
tolerationSeconds field that dictates how long the pod will stay bound to the node after the
taint is added. For example,

tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600

means that if this pod is running and a matching taint is added to the node, then the pod will
stay bound to the node for 3600 seconds, and then be evicted. If the taint is removed before
that time, the pod will not be evicted.

Example Use Cases

Taints and tolerations are a flexible way to steer pods away from nodes or evict pods that
shouldn't be running. A few of the use cases are

Dedicated Nodes: If you want to dedicate a set of nodes for exclusive use by a
particular set of users, you can add a taint to those nodes (say, kubectl taint nodes
nodename dedicated=groupName:NoSchedule ) and then add a corresponding toleration to
their pods (this would be done most easily by writing a custom admission controller).
The pods with the tolerations will then be allowed to use the tainted (dedicated) nodes
as well as any other nodes in the cluster. If you want to dedicate the nodes to them and
ensure they only use the dedicated nodes, then you should additionally add a label
similar to the taint to the same set of nodes (e.g. dedicated=groupName ), and the
admission controller should additionally add a node affinity to require that the pods can
only schedule onto nodes labeled with dedicated=groupName .

Nodes with Special Hardware: In a cluster where a small subset of nodes have
specialized hardware (for example GPUs), it is desirable to keep pods that don't need the
specialized hardware off of those nodes, thus leaving room for later-arriving pods that
do need the specialized hardware. This can be done by tainting the nodes that have the
specialized hardware (e.g. kubectl taint nodes nodename special=true:NoSchedule or
kubectl taint nodes nodename special=true:PreferNoSchedule ) and adding a
corresponding toleration to pods that use the special hardware. As in the dedicated
nodes use case, it is probably easiest to apply the tolerations using a custom admission
controller. For example, it is recommended to use Extended Resources to represent the
special hardware, taint your special hardware nodes with the extended resource name
and run the ExtendedResourceToleration admission controller. Now, because the nodes

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 34/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

are tainted, no pods without the toleration will schedule on them. But when you submit
a pod that requests the extended resource, the ExtendedResourceToleration admission
controller will automatically add the correct toleration to the pod and that pod will
schedule on the special hardware nodes. This will make sure that these special
hardware nodes are dedicated for pods requesting such hardware and you don't have to
manually add tolerations to your pods.
Taint based Evictions: A per-pod-configurable eviction behavior when there are node
problems, which is described in the next section.

Taint based Evictions

FEATURE STATE: Kubernetes v1.18 [stable]

The NoExecute taint effect, mentioned above, affects pods that are already running on the
node as follows

pods that do not tolerate the taint are evicted immediately

pods that tolerate the taint without specifying tolerationSeconds in their toleration
specification remain bound forever
pods that tolerate the taint with a specified tolerationSeconds remain bound for the
specified amount of time

The node controller automatically taints a Node when certain conditions are true. The
following taints are built in:

node.kubernetes.io/not-ready : Node is not ready. This corresponds to the

NodeCondition Ready being " False ".
node.kubernetes.io/unreachable : Node is unreachable from the node controller. This
corresponds to the NodeCondition Ready being " Unknown ".
node.kubernetes.io/memory-pressure : Node has memory pressure.

node.kubernetes.io/disk-pressure : Node has disk pressure.

node.kubernetes.io/pid-pressure : Node has PID pressure.

node.kubernetes.io/network-unavailable : Node's network is unavailable.

node.kubernetes.io/unschedulable : Node is unschedulable.

node.cloudprovider.kubernetes.io/uninitialized : When the kubelet is started with

"external" cloud provider, this taint is set on a node to mark it as unusable. After a
controller from the cloud-controller-manager initializes this node, the kubelet removes
this taint.

In case a node is to be evicted, the node controller or the kubelet adds relevant taints with
NoExecute effect. If the fault condition returns to normal the kubelet or node controller can
remove the relevant taint(s).

In some cases when the node is unreachable, the API server is unable to communicate with
the kubelet on the node. The decision to delete the pods cannot be communicated to the
kubelet until communication with the API server is re-established. In the meantime, the pods
that are scheduled for deletion may continue to run on the partitioned node.

Note: The control plane limits the rate of adding node new taints to nodes. This rate
limiting manages the number of evictions that are triggered when many nodes become
unreachable at once (for example: if there is a network disruption).

You can specify tolerationSeconds for a Pod to define how long that Pod stays bound to a
failing or unresponsive Node.

For example, you might want to keep an application with a lot of local state bound to node for
a long time in the event of network partition, hoping that the partition will recover and thus
the pod eviction can be avoided. The toleration you set for that Pod might look like:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 35/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 6000

Note:
Kubernetes automatically adds a toleration for node.kubernetes.io/not-ready and
node.kubernetes.io/unreachable with tolerationSeconds=300 , unless you, or a
controller, set those tolerations explicitly.

These automatically-added tolerations mean that Pods remain bound to Nodes for 5
minutes after one of these problems is detected.

DaemonSet pods are created with NoExecute tolerations for the following taints with no
tolerationSeconds :

node.kubernetes.io/unreachable

node.kubernetes.io/not-ready

This ensures that DaemonSet pods are never evicted due to these problems.

Taint Nodes by Condition

The control plane, using the node controller, automatically creates taints with a NoSchedule
effect for node conditions.

The scheduler checks taints, not node conditions, when it makes scheduling decisions. This
ensures that node conditions don't directly affect scheduling. For example, if the
DiskPressure node condition is active, the control plane adds the node.kubernetes.io/disk-
pressure taint and does not schedule new pods onto the affected node. If the
MemoryPressure node condition is active, the control plane adds the
node.kubernetes.io/memory-pressure taint.

You can ignore node conditions for newly created pods by adding the corresponding Pod
tolerations. The control plane also adds the node.kubernetes.io/memory-pressure toleration
on pods that have a QoS class other than BestEffort . This is because Kubernetes treats
pods in the Guaranteed or Burstable QoS classes (even pods with no memory request set)
as if they are able to cope with memory pressure, while new BestEffort pods are not
scheduled onto the affected node.

The DaemonSet controller automatically adds the following NoSchedule tolerations to all
daemons, to prevent DaemonSets from breaking.

node.kubernetes.io/memory-pressure

node.kubernetes.io/disk-pressure

node.kubernetes.io/pid-pressure (1.14 or later)

node.kubernetes.io/unschedulable (1.10 or later)

node.kubernetes.io/network-unavailable (host network only)

Adding these tolerations ensures backward compatibility. You can also add arbitrary
tolerations to DaemonSets.

What's next
Read about Node-pressure Eviction and how you can configure it
Read about Pod Priority

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 36/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

7 - Scheduling Framework
FEATURE STATE: Kubernetes v1.19 [stable]

The scheduling framework is a pluggable architecture for the Kubernetes scheduler. It adds a
new set of "plugin" APIs to the existing scheduler. Plugins are compiled into the scheduler.
The APIs allow most scheduling features to be implemented as plugins, while keeping the
scheduling "core" lightweight and maintainable. Refer to the design proposal of the
scheduling framework for more technical information on the design of the framework.

Framework workflow
The Scheduling Framework defines a few extension points. Scheduler plugins register to be
invoked at one or more extension points. Some of these plugins can change the scheduling
decisions and some are informational only.

Each attempt to schedule one Pod is split into two phases, the scheduling cycle and the
binding cycle.

Scheduling Cycle & Binding Cycle

The scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to
the cluster. Together, a scheduling cycle and binding cycle are referred to as a "scheduling
context".

Scheduling cycles are run serially, while binding cycles may run concurrently.

A scheduling or binding cycle can be aborted if the Pod is determined to be unschedulable or

if there is an internal error. The Pod will be returned to the queue and retried.

Extension points
The following picture shows the scheduling context of a Pod and the extension points that the
scheduling framework exposes. In this picture "Filter" is equivalent to "Predicate" and
"Scoring" is equivalent to "Priority function".

One plugin may register at multiple extension points to perform more complex or stateful
tasks.

Scheduling framework extension points

PreEnqueue
These plugins are called prior to adding Pods to the internal active queue, where Pods are
marked as ready for scheduling.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 37/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Only when all PreEnqueue plugins return Success , the Pod is allowed to enter the active
queue. Otherwise, it's placed in the internal unschedulable Pods list, and doesn't get an
Unschedulable condition.

For more details about how internal scheduler queues work, read Scheduling queue in kube-
scheduler.

QueueSort
These plugins are used to sort Pods in the scheduling queue. A queue sort plugin essentially
provides a Less(Pod1, Pod2) function. Only one queue sort plugin may be enabled at a time.

PreFilter
These plugins are used to pre-process info about the Pod, or to check certain conditions that
the cluster or the Pod must meet. If a PreFilter plugin returns an error, the scheduling cycle is
aborted.

Filter
These plugins are used to filter out nodes that cannot run the Pod. For each node, the
scheduler will call filter plugins in their configured order. If any filter plugin marks the node as
infeasible, the remaining plugins will not be called for that node. Nodes may be evaluated
concurrently.

PostFilter
These plugins are called after Filter phase, but only when no feasible nodes were found for
the pod. Plugins are called in their configured order. If any postFilter plugin marks the node as
Schedulable , the remaining plugins will not be called. A typical PostFilter implementation is
preemption, which tries to make the pod schedulable by preempting other Pods.

PreScore
These plugins are used to perform "pre-scoring" work, which generates a sharable state for
Score plugins to use. If a PreScore plugin returns an error, the scheduling cycle is aborted.

Score
These plugins are used to rank nodes that have passed the filtering phase. The scheduler will
call each scoring plugin for each node. There will be a well defined range of integers
representing the minimum and maximum scores. After the NormalizeScore phase, the
scheduler will combine node scores from all plugins according to the configured plugin
weights.

NormalizeScore
These plugins are used to modify scores before the scheduler computes a final ranking of
Nodes. A plugin that registers for this extension point will be called with the Score results
from the same plugin. This is called once per plugin per scheduling cycle.

For example, suppose a plugin BlinkingLightScorer ranks Nodes based on how many
blinking lights they have.

func ScoreNode(_ v1.pod, n v1.Node) (int, error) {

return getBlinkingLightCount(n)
}

However, the maximum count of blinking lights may be small compared to NodeScoreMax . To
fix this, BlinkingLightScorer should also register for this extension point.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 38/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

func NormalizeScores(scores map[string]int) {

highest := 0
for _, score := range scores {
highest = max(highest, score)
}
for node, score := range scores {
scores[node] = score*NodeScoreMax/highest
}
}

If any NormalizeScore plugin returns an error, the scheduling cycle is aborted.

Note: Plugins wishing to perform "pre-reserve" work should use the NormalizeScore
extension point.

Reserve
A plugin that implements the Reserve extension has two methods, namely Reserve and
Unreserve , that back two informational scheduling phases called Reserve and Unreserve,
respectively. Plugins which maintain runtime state (aka "stateful plugins") should use these
phases to be notified by the scheduler when resources on a node are being reserved and
unreserved for a given Pod.

The Reserve phase happens before the scheduler actually binds a Pod to its designated node.
It exists to prevent race conditions while the scheduler waits for the bind to succeed. The
Reserve method of each Reserve plugin may succeed or fail; if one Reserve method call
fails, subsequent plugins are not executed and the Reserve phase is considered to have failed.
If the Reserve method of all plugins succeed, the Reserve phase is considered to be
successful and the rest of the scheduling cycle and the binding cycle are executed.

The Unreserve phase is triggered if the Reserve phase or a later phase fails. When this
happens, the Unreserve method of all Reserve plugins will be executed in the reverse order
of Reserve method calls. This phase exists to clean up the state associated with the reserved
Pod.

Caution: The implementation of the Unreserve method in Reserve plugins must be

idempotent and may not fail.

Permit
Permit plugins are invoked at the end of the scheduling cycle for each Pod, to prevent or delay
the binding to the candidate node. A permit plugin can do one of the three things:

1. approve
Once all Permit plugins approve a Pod, it is sent for binding.

2. deny
If any Permit plugin denies a Pod, it is returned to the scheduling queue. This will trigger
the Unreserve phase in Reserve plugins.

3. wait (with a timeout)

If a Permit plugin returns "wait", then the Pod is kept in an internal "waiting" Pods list,
and the binding cycle of this Pod starts but directly blocks until it gets approved. If a
timeout occurs, wait becomes deny and the Pod is returned to the scheduling queue,
triggering the Unreserve phase in Reserve plugins.

Note: While any plugin can access the list of "waiting" Pods and approve them (see
FrameworkHandle), we expect only the permit plugins to approve binding of reserved Pods
that are in "waiting" state. Once a Pod is approved, it is sent to the PreBind phase.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 39/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

PreBind
These plugins are used to perform any work required before a Pod is bound. For example, a
pre-bind plugin may provision a network volume and mount it on the target node before
allowing the Pod to run there.

If any PreBind plugin returns an error, the Pod is rejected and returned to the scheduling
queue.

Bind
These plugins are used to bind a Pod to a Node. Bind plugins will not be called until all
PreBind plugins have completed. Each bind plugin is called in the configured order. A bind
plugin may choose whether or not to handle the given Pod. If a bind plugin chooses to handle
a Pod, the remaining bind plugins are skipped.

PostBind
This is an informational extension point. Post-bind plugins are called after a Pod is
successfully bound. This is the end of a binding cycle, and can be used to clean up associated
resources.

Plugin API
There are two steps to the plugin API. First, plugins must register and get configured, then
they use the extension point interfaces. Extension point interfaces have the following form.

type Plugin interface {

Name() string
}

type QueueSortPlugin interface {

Plugin
Less(*v1.pod, *v1.pod) bool
}

type PreFilterPlugin interface {

Plugin
PreFilter(context.Context, *framework.CycleState, *v1.pod) error
}

// ...

Plugin configuration
You can enable or disable plugins in the scheduler configuration. If you are using Kubernetes
v1.18 or later, most scheduling plugins are in use and enabled by default.

In addition to default plugins, you can also implement your own scheduling plugins and get
them configured along with default plugins. You can visit scheduler-plugins for more details.

If you are using Kubernetes v1.18 or later, you can configure a set of plugins as a scheduler
profile and then define multiple profiles to fit various kinds of workload. Learn more at
multiple profiles.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 40/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

8 - Dynamic Resource Allocation

FEATURE STATE: Kubernetes v1.27 [alpha]

Dynamic resource allocation is a new API for requesting and sharing resources between pods
and containers inside a pod. It is a generalization of the persistent volumes API for generic
resources. Third-party resource drivers are responsible for tracking and allocating resources.
Different kinds of resources support arbitrary parameters for defining requirements and
initialization.

Before you begin

Kubernetes v1.27 includes cluster-level API support for dynamic resource allocation, but it
needs to be enabled explicitly. You also must install a resource driver for specific resources
that are meant to be managed using this API. If you are not running Kubernetes v1.27, check
the documentation for that version of Kubernetes.

API
The resource.k8s.io/v1alpha2 API group provides four new types:

ResourceClass
Defines which resource driver handles a certain kind of resource and provides common
parameters for it. ResourceClasses are created by a cluster administrator when installing a
resource driver.

ResourceClaim
Defines a particular resource instances that is required by a workload. Created by a user
(lifecycle managed manually, can be shared between different Pods) or for individual Pods
by the control plane based on a ResourceClaimTemplate (automatic lifecycle, typically used
by just one Pod).

ResourceClaimTemplate
Defines the spec and some meta data for creating ResourceClaims. Created by a user
when deploying a workload.

PodSchedulingContext
Used internally by the control plane and resource drivers to coordinate pod scheduling
when ResourceClaims need to be allocated for a Pod.

Parameters for ResourceClass and ResourceClaim are stored in separate objects, typically
using the type defined by a CRD that was created when installing a resource driver.

The core/v1 defines ResourceClaims that are needed for a Pod in a new
PodSpec
resourceClaims field. Entries in that list reference either a ResourceClaim or a
ResourceClaimTemplate. When referencing a ResourceClaim, all Pods using this PodSpec (for
example, inside a Deployment or StatefulSet) share the same ResourceClaim instance. When
referencing a ResourceClaimTemplate, each Pod gets its own instance.

The resources.claims list for container resources defines whether a container gets access to
these resource instances, which makes it possible to share resources between one or more
containers.

Here is an example for a fictional resource driver. Two ResourceClaim objects will get created
for this Pod and each container gets access to one of them.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 41/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClass
name: resource.example.com
driverName: resource-driver.example.com
---
apiVersion: cats.resource.example.com/v1
kind: ClaimParameters
name: large-black-cat-claim-parameters
spec:
color: black
size: large
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: large-black-cat-claim-template
spec:
spec:
resourceClassName: resource.example.com
parametersRef:
apiGroup: cats.resource.example.com
kind: ClaimParameters
name: large-black-cat-claim-parameters
–--
apiVersion: v1
kind: Pod
metadata:
name: pod-with-cats
spec:
containers:
- name: container0
image: ubuntu:20.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-0
- name: container1
image: ubuntu:20.04
command: ["sleep", "9999"]
resources:
claims:
- name: cat-1
resourceClaims:
- name: cat-0
source:
resourceClaimTemplateName: large-black-cat-claim-template
- name: cat-1
source:
resourceClaimTemplateName: large-black-cat-claim-template

Scheduling
In contrast to native resources (CPU, RAM) and extended resources (managed by a device
plugin, advertised by kubelet), the scheduler has no knowledge of what dynamic resources
are available in a cluster or how they could be split up to satisfy the requirements of a specific
ResourceClaim. Resource drivers are responsible for that. They mark ResourceClaims as
"allocated" once resources for it are reserved. This also then tells the scheduler where in the
cluster a ResourceClaim is available.

ResourceClaims can get allocated as soon as they are created ("immediate allocation"),
without considering which Pods will use them. The default is to delay allocation until a Pod
gets scheduled which needs the ResourceClaim (i.e. "wait for first consumer").

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 42/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

In that mode, the scheduler checks all ResourceClaims needed by a Pod and creates a
PodScheduling object where it informs the resource drivers responsible for those
ResourceClaims about nodes that the scheduler considers suitable for the Pod. The resource
drivers respond by excluding nodes that don't have enough of the driver's resources left.
Once the scheduler has that information, it selects one node and stores that choice in the
PodScheduling object. The resource drivers then allocate their ResourceClaims so that the
resources will be available on that node. Once that is complete, the Pod gets scheduled.

As part of this process, ResourceClaims also get reserved for the Pod. Currently
ResourceClaims can either be used exclusively by a single Pod or an unlimited number of
Pods.

One key feature is that Pods do not get scheduled to a node unless all of their resources are
allocated and reserved. This avoids the scenario where a Pod gets scheduled onto one node
and then cannot run there, which is bad because such a pending Pod also blocks all other
resources like RAM or CPU that were set aside for it.

Monitoring resources
The kubelet provides a gRPC service to enable discovery of dynamic resources of running
Pods. For more information on the gRPC endpoints, see the resource allocation reporting.

Limitations
The scheduler plugin must be involved in scheduling Pods which use ResourceClaims.
Bypassing the scheduler by setting the nodeName field leads to Pods that the kubelet refuses
to start because the ResourceClaims are not reserved or not even allocated. It may be
possible to remove this limitation in the future.

Enabling dynamic resource allocation

Dynamic resource allocation is an alpha feature and only enabled when the
DynamicResourceAllocation feature gate and the resource.k8s.io/v1alpha2 API group are
enabled. For details on that, see the --feature-gates and --runtime-config kube-apiserver
parameters. kube-scheduler, kube-controller-manager and kubelet also need the feature
gate.

A quick check whether a Kubernetes cluster supports the feature is to list ResourceClass
objects with:

kubectl get resourceclasses

If your cluster supports dynamic resource allocation, the response is either a list of
ResourceClass objects or:

No resources found

If not supported, this error is printed instead:

error: the server doesn't have a resource type "resourceclasses"

The default configuration of kube-scheduler enables the "DynamicResources" plugin if and

only if the feature gate is enabled and when using the v1 configuration API. Custom
configurations may have to be modified to include it.

In addition to enabling the feature in the cluster, a resource driver also has to be installed.
Please refer to the driver's documentation for details.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 43/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

What's next
For more information on the design, see the Dynamic Resource Allocation KEP.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 44/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

9 - Scheduler Performance Tuning

FEATURE STATE: Kubernetes v1.14 [beta]

kube-scheduler is the Kubernetes default scheduler. It is responsible for placement of Pods

on Nodes in a cluster.

Nodes in a cluster that meet the scheduling requirements of a Pod are called feasible Nodes
for the Pod. The scheduler finds feasible Nodes for a Pod and then runs a set of functions to
score the feasible Nodes, picking a Node with the highest score among the feasible ones to
run the Pod. The scheduler then notifies the API server about this decision in a process called
Binding.

This page explains performance tuning optimizations that are relevant for large Kubernetes
clusters.

In large clusters, you can tune the scheduler's behaviour balancing scheduling outcomes
between latency (new Pods are placed quickly) and accuracy (the scheduler rarely makes poor
placement decisions).

You configure this tuning setting via kube-scheduler setting percentageOfNodesToScore . This
KubeSchedulerConfiguration setting determines a threshold for scheduling nodes in your
cluster.

Setting the threshold

The percentageOfNodesToScore option accepts whole numeric values between 0 and 100.
The value 0 is a special number which indicates that the kube-scheduler should use its
compiled-in default. If you set percentageOfNodesToScore above 100, kube-scheduler acts as
if you had set a value of 100.

To change the value, edit the kube-scheduler configuration file and then restart the scheduler.
In many cases, the configuration file can be found at /etc/kubernetes/config/kube-
scheduler.yaml .

After you have made this change, you can run

kubectl get pods -n kube-system | grep kube-scheduler

to verify that the kube-scheduler component is healthy.

Node scoring threshold

To improve scheduling performance, the kube-scheduler can stop looking for feasible nodes
once it has found enough of them. In large clusters, this saves time compared to a naive
approach that would consider every node.

You specify a threshold for how many nodes are enough, as a whole number percentage of all
the nodes in your cluster. The kube-scheduler converts this into an integer number of nodes.
During scheduling, if the kube-scheduler has identified enough feasible nodes to exceed the
configured percentage, the kube-scheduler stops searching for more feasible nodes and
moves on to the scoring phase.

How the scheduler iterates over Nodes describes the process in detail.

Default threshold
If you don't specify a threshold, Kubernetes calculates a figure using a linear formula that
yields 50% for a 100-node cluster and yields 10% for a 5000-node cluster. The lower bound for
the automatic value is 5%.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 45/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

This means that, the kube-scheduler always scores at least 5% of your cluster no matter how
large the cluster is, unless you have explicitly set percentageOfNodesToScore to be smaller
than 5.

If you want the scheduler to score all nodes in your cluster, set percentageOfNodesToScore to
100.

Example
Below is an example configuration that sets percentageOfNodesToScore to 50%.

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider

...

percentageOfNodesToScore: 50

Tuning percentageOfNodesToScore
percentageOfNodesToScore must be a value between 1 and 100 with the default value being
calculated based on the cluster size. There is also a hardcoded minimum value of 50 nodes.

Note:
In clusters with less than 50 feasible nodes, the scheduler still checks all the nodes
because there are not enough feasible nodes to stop the scheduler's search early.

In a small cluster, if you set a low value for percentageOfNodesToScore , your change will
have no or little effect, for a similar reason.

If your cluster has several hundred Nodes or fewer, leave this configuration option at its
default value. Making changes is unlikely to improve the scheduler's performance
significantly.

An important detail to consider when setting this value is that when a smaller number of
nodes in a cluster are checked for feasibility, some nodes are not sent to be scored for a given
Pod. As a result, a Node which could possibly score a higher value for running the given Pod
might not even be passed to the scoring phase. This would result in a less than ideal
placement of the Pod.

You should avoid setting percentageOfNodesToScore very low so that kube-scheduler does
not make frequent, poor Pod placement decisions. Avoid setting the percentage to anything
below 10%, unless the scheduler's throughput is critical for your application and the score of
nodes is not important. In other words, you prefer to run the Pod on any Node as long as it is
feasible.

How the scheduler iterates over Nodes

This section is intended for those who want to understand the internal details of this feature.

In order to give all the Nodes in a cluster a fair chance of being considered for running Pods,
the scheduler iterates over the nodes in a round robin fashion. You can imagine that Nodes
are in an array. The scheduler starts from the start of the array and checks feasibility of the

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 46/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

nodes until it finds enough Nodes as specified by percentageOfNodesToScore . For the next
Pod, the scheduler continues from the point in the Node array that it stopped at when
checking feasibility of Nodes for the previous Pod.

If Nodes are in multiple zones, the scheduler iterates over Nodes in various zones to ensure
that Nodes from different zones are considered in the feasibility checks. As an example,
consider six nodes in two zones:

Zone 1: Node 1, Node 2, Node 3, Node 4

Zone 2: Node 5, Node 6

The Scheduler evaluates feasibility of the nodes in this order:

Node 1, Node 5, Node 2, Node 6, Node 3, Node 4

After going over all the Nodes, it goes back to Node 1.

What's next
Check the kube-scheduler configuration reference (v1beta3)

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 47/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

10 - Resource Bin Packing

In the scheduling-plugin NodeResourcesFit of kube-scheduler, there are two scoring
strategies that support the bin packing of resources: MostAllocated and
RequestedToCapacityRatio .

Enabling bin packing using MostAllocated

strategy
The MostAllocated strategy scores the nodes based on the utilization of resources, favoring
the ones with higher allocation. For each resource type, you can set a weight to modify its
influence in the node score.

To set the MostAllocated strategy for the NodeResourcesFit plugin, use a scheduler
configuration similar to the following:

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
scoringStrategy:
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 3
type: MostAllocated
name: NodeResourcesFit

To learn more about other parameters and their default configuration, see the API
documentation for NodeResourcesFitArgs .

Enabling bin packing using

RequestedToCapacityRatio
The RequestedToCapacityRatio strategy allows the users to specify the resources along with
weights for each resource to score nodes based on the request to capacity ratio. This allows
users to bin pack extended resources by using appropriate parameters to improve the
utilization of scarce resources in large clusters. It favors nodes according to a configured
function of the allocated resources. The behavior of the RequestedToCapacityRatio in the
NodeResourcesFit score function can be controlled by the scoringStrategy field. Within the
scoringStrategy field, you can configure two parameters: requestedToCapacityRatio and
resources . The shape in the requestedToCapacityRatio parameter allows the user to tune
the function as least requested or most requested based on utilization and score values.
The resources parameter consists of name of the resource to be considered during scoring
and weight specify the weight of each resource.

Below is an example configuration that sets the bin packing behavior for extended resources
intel.com/foo and intel.com/bar using the requestedToCapacityRatio field.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 48/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- args:
scoringStrategy:
resources:
- name: intel.com/foo
weight: 3
- name: intel.com/bar
weight: 3
requestedToCapacityRatio:
shape:
- utilization: 0
score: 0
- utilization: 100
score: 10
type: RequestedToCapacityRatio
name: NodeResourcesFit

Referencing the KubeSchedulerConfiguration file with the kube-scheduler flag --

config=/path/to/config/file will pass the configuration to the scheduler.

To learn more about other parameters and their default configuration, see the API
documentation for NodeResourcesFitArgs .

Tuning the score function

shape is used to specify the behavior of the RequestedToCapacityRatio function.

shape:
- utilization: 0
score: 0
- utilization: 100
score: 10

The above arguments give the node a score of 0 if utilization is 0% and 10 for
utilization 100%, thus enabling bin packing behavior. To enable least requested the score
value must be reversed as follows.

shape:
- utilization: 0
score: 10
- utilization: 100
score: 0

resources is an optional parameter which defaults to:

resources:
- name: cpu
weight: 1
- name: memory
weight: 1

It can be used to add extended resources as follows:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 49/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

resources:
- name: intel.com/foo
weight: 5
- name: cpu
weight: 3
- name: memory
weight: 1

The weight parameter is optional and is set to 1 if not specified. Also, the weight cannot be
set to a negative value.

Node scoring for capacity allocation

This section is intended for those who want to understand the internal details of this feature.
Below is an example of how the node score is calculated for a given set of values.

Requested resources:

intel.com/foo : 2
memory: 256MB
cpu: 2

Resource weights:

intel.com/foo : 5
memory: 1
cpu: 3

FunctionShapePoint {{0, 0}, {100, 10}}

Node 1 spec:

Available:
intel.com/foo: 4
memory: 1 GB
cpu: 8

Used:
intel.com/foo: 1
memory: 256MB
cpu: 1

Node score:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 50/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

intel.com/foo = resourceScoringFunction((2+1),4)
= (100 - ((4-3)*100/4)
= (100 - 25)
= 75 # requested + used = 75% * available
= rawScoringFunction(75)
= 7 # floor(75/10)

memory = resourceScoringFunction((256+256),1024)
= (100 -((1024-512)*100/1024))
= 50 # requested + used = 50% * available
= rawScoringFunction(50)
= 5 # floor(50/10)

cpu = resourceScoringFunction((2+1),8)
= (100 -((8-3)*100/8))
= 37.5 # requested + used = 37.5% * available
= rawScoringFunction(37.5)
= 3 # floor(37.5/10)

NodeScore = (7 * 5) + (5 * 1) + (3 * 3) / (5 + 1 + 3)
= 5

Node 2 spec:

Available:
intel.com/foo: 8
memory: 1GB
cpu: 8
Used:
intel.com/foo: 2
memory: 512MB
cpu: 6

Node score:

intel.com/foo = resourceScoringFunction((2+2),8)
= (100 - ((8-4)*100/8)
= (100 - 50)
= 50
= rawScoringFunction(50)
= 5

memory = resourceScoringFunction((256+512),1024)
= (100 -((1024-768)*100/1024))
= 75
= rawScoringFunction(75)
= 7

cpu = resourceScoringFunction((2+6),8)
= (100 -((8-8)*100/8))
= 100
= rawScoringFunction(100)
= 10

NodeScore = (5 * 5) + (7 * 1) + (10 * 3) / (5 + 1 + 3)
= 7

What's next
Read more about the scheduling framework
Read more about scheduler configuration

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 51/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

11 - Pod Priority and Preemption

FEATURE STATE: Kubernetes v1.14 [stable]

Pods can have priority. Priority indicates the importance of a Pod relative to other Pods. If a
Pod cannot be scheduled, the scheduler tries to preempt (evict) lower priority Pods to make
scheduling of the pending Pod possible.

Warning:
In a cluster where not all users are trusted, a malicious user could create Pods at the
highest possible priorities, causing other Pods to be evicted/not get scheduled. An
administrator can use ResourceQuota to prevent users from creating pods at high
priorities.

See limit Priority Class consumption by default for details.

How to use priority and preemption

To use priority and preemption:

1. Add one or more PriorityClasses.

2. Create Pods with priorityClassName set to one of the added PriorityClasses. Of course
you do not need to create the Pods directly; normally you would add
priorityClassName to the Pod template of a collection object like a Deployment.

Keep reading for more information about these steps.

Note: Kubernetes already ships with two PriorityClasses: system-cluster-critical and

system-node-critical. These are common classes and are used to ensure that critical
components are always scheduled first.

PriorityClass
A PriorityClass is a non-namespaced object that defines a mapping from a priority class name
to the integer value of the priority. The name is specified in the name field of the PriorityClass
object's metadata. The value is specified in the required value field. The higher the value, the
higher the priority. The name of a PriorityClass object must be a valid DNS subdomain name,
and it cannot be prefixed with system- .

A PriorityClass object can have any 32-bit integer value smaller than or equal to 1 billion. This
means that the range of values for a PriorityClass object is from -2147483648 to 1000000000
inclusive. Larger numbers are reserved for built-in PriorityClasses that represent critical
system Pods. A cluster admin should create one PriorityClass object for each such mapping
that they want.

PriorityClass also has two optional fields: globalDefault and description . The
globalDefault field indicates that the value of this PriorityClass should be used for Pods
without a priorityClassName . Only one PriorityClass with globalDefault set to true can
exist in the system. If there is no PriorityClass with globalDefault set, the priority of Pods
with no priorityClassName is zero.

The description field is an arbitrary string. It is meant to tell users of the cluster when they
should use this PriorityClass.

Notes about PodPriority and existing clusters

If you upgrade an existing cluster without this feature, the priority of your existing Pods
is effectively zero.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 52/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Addition of a PriorityClass with globalDefault set to true does not change the
priorities of existing Pods. The value of such a PriorityClass is used only for Pods created
after the PriorityClass is added.

If you delete a PriorityClass, existing Pods that use the name of the deleted PriorityClass
remain unchanged, but you cannot create more Pods that use the name of the deleted
PriorityClass.

Example PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."

Non-preempting PriorityClass
FEATURE STATE: Kubernetes v1.24 [stable]

Pods with preemptionPolicy: Never will be placed in the scheduling queue ahead of lower-
priority pods, but they cannot preempt other pods. A non-preempting pod waiting to be
scheduled will stay in the scheduling queue, until sufficient resources are free, and it can be
scheduled. Non-preempting pods, like other pods, are subject to scheduler back-off. This
means that if the scheduler tries these pods and they cannot be scheduled, they will be
retried with lower frequency, allowing other pods with lower priority to be scheduled before
them.

Non-preempting pods may still be preempted by other, high-priority pods.

preemptionPolicy defaults to PreemptLowerPriority , which will allow pods of that

PriorityClass to preempt lower-priority pods (as is existing default behavior). If
preemptionPolicy is set to Never , pods in that PriorityClass will be non-preempting.

An example use case is for data science workloads. A user may submit a job that they want to
be prioritized above other workloads, but do not wish to discard existing work by preempting
running pods. The high priority job with preemptionPolicy: Never will be scheduled ahead of
other queued pods, as soon as sufficient cluster resources "naturally" become free.

Example Non-preempting PriorityClass

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 1000000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not cause other pods to be preempted."

Pod priority
After you have one or more PriorityClasses, you can create Pods that specify one of those
PriorityClass names in their specifications. The priority admission controller uses the
priorityClassName field and populates the integer value of the priority. If the priority class is
not found, the Pod is rejected.
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 53/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

The following YAML is an example of a Pod configuration that uses the PriorityClass created in
the preceding example. The priority admission controller checks the specification and
resolves the priority of the Pod to 1000000.

apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority

Effect of Pod priority on scheduling order

When Pod priority is enabled, the scheduler orders pending Pods by their priority and a
pending Pod is placed ahead of other pending Pods with lower priority in the scheduling
queue. As a result, the higher priority Pod may be scheduled sooner than Pods with lower
priority if its scheduling requirements are met. If such Pod cannot be scheduled, scheduler
will continue and tries to schedule other lower priority Pods.

Preemption
When Pods are created, they go to a queue and wait to be scheduled. The scheduler picks a
Pod from the queue and tries to schedule it on a Node. If no Node is found that satisfies all
the specified requirements of the Pod, preemption logic is triggered for the pending Pod. Let's
call the pending Pod P. Preemption logic tries to find a Node where removal of one or more
Pods with lower priority than P would enable P to be scheduled on that Node. If such a Node
is found, one or more lower priority Pods get evicted from the Node. After the Pods are gone,
P can be scheduled on the Node.

User exposed information

When Pod P preempts one or more Pods on Node N, nominatedNodeName field of Pod P's
status is set to the name of Node N. This field helps scheduler track resources reserved for
Pod P and also gives users information about preemptions in their clusters.

Please note that Pod P is not necessarily scheduled to the "nominated Node". The scheduler
always tries the "nominated Node" before iterating over any other nodes. After victim Pods
are preempted, they get their graceful termination period. If another node becomes available
while scheduler is waiting for the victim Pods to terminate, scheduler may use the other node
to schedule Pod P. As a result nominatedNodeName and nodeName of Pod spec are not always
the same. Also, if scheduler preempts Pods on Node N, but then a higher priority Pod than
Pod P arrives, scheduler may give Node N to the new higher priority Pod. In such a case,
scheduler clears nominatedNodeName of Pod P. By doing this, scheduler makes Pod P eligible
to preempt Pods on another Node.

Limitations of preemption

Graceful termination of preemption victims

When Pods are preempted, the victims get their graceful termination period. They have that
much time to finish their work and exit. If they don't, they are killed. This graceful termination
period creates a time gap between the point that the scheduler preempts Pods and the time
when the pending Pod (P) can be scheduled on the Node (N). In the meantime, the scheduler
keeps scheduling other pending Pods. As victims exit or get terminated, the scheduler tries to
https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 54/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

schedule Pods in the pending queue. Therefore, there is usually a time gap between the point
that scheduler preempts victims and the time that Pod P is scheduled. In order to minimize
this gap, one can set graceful termination period of lower priority Pods to zero or a small
number.

PodDisruptionBudget is supported, but not guaranteed

A PodDisruptionBudget (PDB) allows application owners to limit the number of Pods of a
replicated application that are down simultaneously from voluntary disruptions. Kubernetes
supports PDB when preempting Pods, but respecting PDB is best effort. The scheduler tries to
find victims whose PDB are not violated by preemption, but if no such victims are found,
preemption will still happen, and lower priority Pods will be removed despite their PDBs being
violated.

Inter-Pod affinity on lower-priority Pods

A Node is considered for preemption only when the answer to this question is yes: "If all the
Pods with lower priority than the pending Pod are removed from the Node, can the pending
Pod be scheduled on the Node?"

Note: Preemption does not necessarily remove all lower-priority Pods. If the pending Pod
can be scheduled by removing fewer than all lower-priority Pods, then only a portion of
the lower-priority Pods are removed. Even so, the answer to the preceding question must
be yes. If the answer is no, the Node is not considered for preemption.

If a pending Pod has inter-pod affinity to one or more of the lower-priority Pods on the Node,
the inter-Pod affinity rule cannot be satisfied in the absence of those lower-priority Pods. In
this case, the scheduler does not preempt any Pods on the Node. Instead, it looks for another
Node. The scheduler might find a suitable Node or it might not. There is no guarantee that the
pending Pod can be scheduled.

Our recommended solution for this problem is to create inter-Pod affinity only towards equal
or higher priority Pods.

Cross node preemption

Suppose a Node N is being considered for preemption so that a pending Pod P can be
scheduled on N. P might become feasible on N only if a Pod on another Node is preempted.
Here's an example:

Pod P is being considered for Node N.

Pod Q is running on another Node in the same Zone as Node N.
Pod P has Zone-wide anti-affinity with Pod Q ( topologyKey:
topology.kubernetes.io/zone ).

There are no other cases of anti-affinity between Pod P and other Pods in the Zone.
In order to schedule Pod P on Node N, Pod Q can be preempted, but scheduler does not
perform cross-node preemption. So, Pod P will be deemed unschedulable on Node N.

If Pod Q were removed from its Node, the Pod anti-affinity violation would be gone, and Pod P
could possibly be scheduled on Node N.

We may consider adding cross Node preemption in future versions if there is enough demand
and if we find an algorithm with reasonable performance.

Troubleshooting
Pod priority and pre-emption can have unwanted side effects. Here are some examples of
potential problems and ways to deal with them.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 55/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Pods are preempted unnecessarily

Preemption removes existing Pods from a cluster under resource pressure to make room for
higher priority pending Pods. If you give high priorities to certain Pods by mistake, these
unintentionally high priority Pods may cause preemption in your cluster. Pod priority is
specified by setting the priorityClassName field in the Pod's specification. The integer value
for priority is then resolved and populated to the priority field of podSpec .

To address the problem, you can change the priorityClassName for those Pods to use lower
priority classes, or leave that field empty. An empty priorityClassName is resolved to zero by
default.

When a Pod is preempted, there will be events recorded for the preempted Pod. Preemption
should happen only when a cluster does not have enough resources for a Pod. In such cases,
preemption happens only when the priority of the pending Pod (preemptor) is higher than the
victim Pods. Preemption must not happen when there is no pending Pod, or when the
pending Pods have equal or lower priority than the victims. If preemption happens in such
scenarios, please file an issue.

Pods are preempted, but the preemptor is not scheduled

When pods are preempted, they receive their requested graceful termination period, which is
by default 30 seconds. If the victim Pods do not terminate within this period, they are forcibly
terminated. Once all the victims go away, the preemptor Pod can be scheduled.

While the preemptor Pod is waiting for the victims to go away, a higher priority Pod may be
created that fits on the same Node. In this case, the scheduler will schedule the higher priority
Pod instead of the preemptor.

This is expected behavior: the Pod with the higher priority should take the place of a Pod with
a lower priority.

Higher priority Pods are preempted before lower priority pods

The scheduler tries to find nodes that can run a pending Pod. If no node is found, the
scheduler tries to remove Pods with lower priority from an arbitrary node in order to make
room for the pending pod. If a node with low priority Pods is not feasible to run the pending
Pod, the scheduler may choose another node with higher priority Pods (compared to the Pods
on the other node) for preemption. The victims must still have lower priority than the
preemptor Pod.

When there are multiple nodes available for preemption, the scheduler tries to choose the
node with a set of Pods with lowest priority. However, if such Pods have PodDisruptionBudget
that would be violated if they are preempted then the scheduler may choose another node
with higher priority Pods.

When multiple nodes exist for preemption and none of the above scenarios apply, the
scheduler chooses a node with the lowest priority.

Interactions between Pod priority and quality

of service
Pod priority and QoS class are two orthogonal features with few interactions and no default
restrictions on setting the priority of a Pod based on its QoS classes. The scheduler's
preemption logic does not consider QoS when choosing preemption targets. Preemption
considers Pod priority and attempts to choose a set of targets with the lowest priority. Higher-
priority Pods are considered for preemption only if the removal of the lowest priority Pods is
not sufficient to allow the scheduler to schedule the preemptor Pod, or if the lowest priority
Pods are protected by PodDisruptionBudget .

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 56/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

The kubelet uses Priority to determine pod order for node-pressure eviction. You can use the
QoS class to estimate the order in which pods are most likely to get evicted. The kubelet ranks
pods for eviction based on the following factors:

1. Whether the starved resource usage exceeds requests

2. Pod Priority
3. Amount of resource usage relative to requests

See Pod selection for kubelet eviction for more details.

kubelet node-pressure eviction does not evict Pods when their usage does not exceed their
requests. If a Pod with lower priority is not exceeding its requests, it won't be evicted. Another
Pod with higher priority that exceeds its requests may be evicted.

What's next
Read about using ResourceQuotas in connection with PriorityClasses: limit Priority Class
consumption by default
Learn about Pod Disruption
Learn about API-initiated Eviction
Learn about Node-pressure Eviction

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 57/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

12 - Node-pressure Eviction
Node-pressure eviction is the process by which the kubelet proactively terminates pods to
reclaim resources on nodes.

The kubelet monitors resources like memory, disk space, and filesystem inodes on your
cluster's nodes. When one or more of these resources reach specific consumption levels, the
kubelet can proactively fail one or more pods on the node to reclaim resources and prevent
starvation.

During a node-pressure eviction, the kubelet sets the PodPhase for the selected pods to
Failed . This terminates the pods.

Node-pressure eviction is not the same as API-initiated eviction.

The kubelet does not respect your configured PodDisruptionBudget or the pod's
terminationGracePeriodSeconds . If you use soft eviction thresholds, the kubelet respects
your configured eviction-max-pod-grace-period . If you use hard eviction thresholds, it uses
a 0s grace period for termination.

If the pods are managed by a workload resource (such as StatefulSet or Deployment) that
replaces failed pods, the control plane or kube-controller-manager creates new pods in
place of the evicted pods.

Note: The kubelet attempts to reclaim node-level resources before it terminates end-user
pods. For example, it removes unused container images when disk resources are starved.

The kubelet uses various parameters to make eviction decisions, like the following:

Eviction signals
Eviction thresholds
Monitoring intervals

Eviction signals
Eviction signals are the current state of a particular resource at a specific point in time.
Kubelet uses eviction signals to make eviction decisions by comparing the signals to eviction
thresholds, which are the minimum amount of the resource that should be available on the
node.

Kubelet uses the following eviction signals:

Eviction Signal Description

memory.availa memory.available := node.status.capacity[memory] -

ble node.stats.memory.workingSet

nodefs.availa nodefs.available := node.stats.fs.available

ble

nodefs.inodes nodefs.inodesFree := node.stats.fs.inodesFree

Free

imagefs.avail imagefs.available :=
able node.stats.runtime.imagefs.available

imagefs.inode imagefs.inodesFree :=
sFree node.stats.runtime.imagefs.inodesFree

pid.available pid.available := node.stats.rlimit.maxpid -

node.stats.rlimit.curproc

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 58/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

In this table, the Description column shows how kubelet gets the value of the signal. Each
signal supports either a percentage or a literal value. Kubelet calculates the percentage value
relative to the total capacity associated with the signal.

The value for memory.available is derived from the cgroupfs instead of tools like free -m .
This is important because free -m does not work in a container, and if users use the node
allocatable feature, out of resource decisions are made local to the end user Pod part of the
cgroup hierarchy as well as the root node. This script reproduces the same set of steps that
the kubelet performs to calculate memory.available . The kubelet excludes inactive_file (i.e. #
of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that
memory is reclaimable under pressure.

The kubelet supports the following filesystem partitions:

1. : The node's main filesystem, used for local disk volumes, emptyDir, log storage,
nodefs
and more. For example, nodefs contains /var/lib/kubelet/ .
2. imagefs : An optional filesystem that container runtimes use to store container images
and container writable layers.

Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet does not
support other configurations.

Some kubelet garbage collection features are deprecated in favor of eviction:

Existing Flag New Flag Rationale

--image-gc-high- --eviction-hard existing eviction signals can

threshold or --eviction- trigger image garbage collection
soft

--image-gc-low- --eviction- eviction reclaims achieve the

threshold minimum-reclaim same behavior

--maximum-dead- - deprecated once old logs are

containers stored outside of container's
context

--maximum-dead- - deprecated once old logs are

containers-per- stored outside of container's
container context

--minimum-container- - deprecated once old logs are

ttl-duration stored outside of container's
context

Eviction thresholds
You can specify custom eviction thresholds for the kubelet to use when it makes eviction
decisions.

Eviction thresholds have the form [eviction-signal][operator][quantity] , where:

eviction-signal is the eviction signal to use.

operator is the relational operator you want, such as < (less than).

quantity is the eviction threshold amount, such as 1Gi . The value of quantity must
match the quantity representation used by Kubernetes. You can use either literal values
or percentages ( % ).

For example, if a node has 10Gi of total memory and you want trigger eviction if the
available memory falls below 1Gi , you can define the eviction threshold as either
memory.available<10% or memory.available<1Gi . You cannot use both.

You can configure soft and hard eviction thresholds.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 59/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Soft eviction thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified
grace period. The kubelet does not evict pods until the grace period is exceeded. The kubelet
returns an error on startup if there is no specified grace period.

You can specify both a soft eviction threshold grace period and a maximum allowed pod
termination grace period for kubelet to use during evictions. If you specify a maximum
allowed grace period and the soft eviction threshold is met, the kubelet uses the lesser of the
two grace periods. If you do not specify a maximum allowed grace period, the kubelet kills
evicted pods immediately without graceful termination.

You can use the following flags to configure soft eviction thresholds:

eviction-soft : A set of eviction thresholds like memory.available<1.5Gi that can

trigger pod eviction if held over the specified grace period.
eviction-soft-grace-period : A set of eviction grace periods like
memory.available=1m30s that define how long a soft eviction threshold must hold
before triggering a Pod eviction.
eviction-max-pod-grace-period : The maximum allowed grace period (in seconds) to
use when terminating pods in response to a soft eviction threshold being met.

Hard eviction thresholds

A hard eviction threshold has no grace period. When a hard eviction threshold is met, the
kubelet kills pods immediately without graceful termination to reclaim the starved resource.

You can use the eviction-hard flag to configure a set of hard eviction thresholds like
memory.available<1Gi .

The kubelet has the following default hard eviction thresholds:

memory.available<100Mi

nodefs.available<10%

imagefs.available<15%

nodefs.inodesFree<5% (Linux nodes)

These default values of hard eviction thresholds will only be set if none of the parameters is
changed. If you changed the value of any parameter, then the values of other parameters will
not be inherited as the default values and will be set to zero. In order to provide custom
values, you should provide all the thresholds respectively.

Eviction monitoring interval

The kubelet evaluates eviction thresholds based on its configured housekeeping-interval
which defaults to 10s .

Node conditions
The kubelet reports node conditions to reflect that the node is under pressure because hard
or soft eviction threshold is met, independent of configured grace periods.

The kubelet maps eviction signals to node conditions as follows:

Node
Condition Eviction Signal Description

MemoryPre memory.available Available memory on the node has

ssure satisfied an eviction threshold

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 60/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

Node
Condition Eviction Signal Description

DiskPress nodefs.available , Available disk space and inodes on

ure nodefs.inodesFree , either the node's root filesystem or
imagefs.available , or image filesystem has satisfied an
imagefs.inodesFree eviction threshold

PIDPressu pid.available Available processes identifiers on the

re (Linux) node has fallen below an
eviction threshold

The kubelet updates the node conditions based on the configured --node-status-update-
frequency , which defaults to 10s .

Node condition oscillation

In some cases, nodes oscillate above and below soft eviction thresholds without holding for
the defined grace periods. This causes the reported node condition to constantly switch
between true and false , leading to bad eviction decisions.

To protect against oscillation, you can use the eviction-pressure-transition-period flag,

which controls how long the kubelet must wait before transitioning a node condition to a
different state. The transition period has a default value of 5m .

Reclaiming node level resources

The kubelet tries to reclaim node-level resources before it evicts end-user pods.

When a DiskPressure node condition is reported, the kubelet reclaims node-level resources
based on the filesystems on the node.

With imagefs
If the node has a dedicated imagefs filesystem for container runtimes to use, the kubelet
does the following:

If the nodefs filesystem meets the eviction thresholds, the kubelet garbage collects
dead pods and containers.
If the imagefs filesystem meets the eviction thresholds, the kubelet deletes all unused
images.

Without imagefs
If the node only has a nodefs filesystem that meets eviction thresholds, the kubelet frees up
disk space in the following order:

1. Garbage collect dead pods and containers

2. Delete unused images

Pod selection for kubelet eviction

If the kubelet's attempts to reclaim node-level resources don't bring the eviction signal below
the threshold, the kubelet begins to evict end-user pods.

The kubelet uses the following parameters to determine the pod eviction order:

1. Whether the pod's resource usage exceeds requests

2. Pod Priority
3. The pod's resource usage relative to requests

As a result, kubelet ranks and evicts pods in the following order:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 61/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

1. BestEffort or Burstable pods where the usage exceeds requests. These pods are
evicted based on their Priority and then by how much their usage level exceeds the
request.
2. Guaranteed pods and Burstable pods where the usage is less than requests are
evicted last, based on their Priority.

Note: The kubelet does not use the pod's QoS class to determine the eviction order. You
can use the QoS class to estimate the most likely pod eviction order when reclaiming
resources like memory. QoS does not apply to EphemeralStorage requests, so the above
scenario will not apply if the node is, for example, under DiskPressure.

Guaranteed pods are guaranteed only when requests and limits are specified for all the
containers and they are equal. These pods will never be evicted because of another pod's
resource consumption. If a system daemon (such as kubelet and journald ) is consuming
more resources than were reserved via system-reserved or kube-reserved allocations, and
the node only has Guaranteed or Burstable pods using less resources than requests left on
it, then the kubelet must choose to evict one of these pods to preserve node stability and to
limit the impact of resource starvation on other pods. In this case, it will choose to evict pods
of lowest Priority first.

When the kubelet evicts pods in response to inode or PID starvation, it uses the Priority to
determine the eviction order, because inodes and PIDs have no requests.

The kubelet sorts pods differently based on whether the node has a dedicated imagefs
filesystem:

With imagefs
If nodefs is triggering evictions, the kubelet sorts pods based on nodefs usage ( local
volumes + logs of all containers ).

If imagefs is triggering evictions, the kubelet sorts pods based on the writable layer usage of
all containers.

Without imagefs
If nodefs is triggering evictions, the kubelet sorts pods based on their total disk usage ( local
volumes + logs & writable layer of all containers )

Minimum eviction reclaim

In some cases, pod eviction only reclaims a small amount of the starved resource. This can
lead to the kubelet repeatedly hitting the configured eviction thresholds and triggering
multiple evictions.

You can use the --eviction-minimum-reclaim flag or a kubelet config file to configure a
minimum reclaim amount for each resource. When the kubelet notices that a resource is
starved, it continues to reclaim that resource until it reclaims the quantity you specify.

For example, the following configuration sets minimum reclaim amounts:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "500Mi"
nodefs.available: "1Gi"
imagefs.available: "100Gi"
evictionMinimumReclaim:
memory.available: "0Mi"
nodefs.available: "500Mi"
imagefs.available: "2Gi"

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 62/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

In this example, if the nodefs.available signal meets the eviction threshold, the kubelet
reclaims the resource until the signal reaches the threshold of 1Gi , and then continues to
reclaim the minimum amount of 500Mi it until the signal reaches 1.5Gi .

Similarly, the kubelet reclaims the imagefs resource until the imagefs.available signal
reaches 102Gi .

The default eviction-minimum-reclaim is 0 for all resources.

Node out of memory behavior

If the node experiences an out of memory (OOM) event prior to the kubelet being able to
reclaim memory, the node depends on the oom_killer to respond.

The kubelet sets an oom_score_adj value for each container based on the QoS for the pod.

Quality of
Service oom_score_adj

Guaranteed -997

BestEffort 1000

Burstable min(max(2, 1000 - (1000 * memoryRequestBytes) /

machineMemoryCapacityBytes), 999)

Note: The kubelet also sets an oom_score_adj value of -997 for containers in Pods that
have system-node-critical Priority.

If the kubelet can't reclaim memory before a node experiences OOM, the oom_killer
calculates an oom_score based on the percentage of memory it's using on the node, and then
adds the oom_score_adj to get an effective oom_score for each container. It then kills the
container with the highest score.

This means that containers in low QoS pods that consume a large amount of memory relative
to their scheduling requests are killed first.

Unlike pod eviction, if a container is OOM killed, the kubelet can restart it based on its
RestartPolicy .

Best practices
The following sections describe best practices for eviction configuration.

Schedulable resources and eviction policies

When you configure the kubelet with an eviction policy, you should make sure that the
scheduler will not schedule pods if they will trigger eviction because they immediately induce
memory pressure.

Consider the following scenario:

Node memory capacity: 10Gi

Operator wants to reserve 10% of memory capacity for system daemons (kernel,
kubelet , etc.)

Operator wants to evict Pods at 95% memory utilization to reduce incidence of system
OOM.

For this to work, the kubelet is launched as follows:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 63/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

--eviction-hard=memory.available<500Mi
--system-reserved=memory=1.5Gi

In this configuration, the --system-reserved flag reserves 1.5Gi of memory for the system,
which is 10% of the total memory + the eviction threshold amount .

The node can reach the eviction threshold if a pod is using more than its request, or if the
system is using more than 1Gi of memory, which makes the memory.available signal fall
below 500Mi and triggers the threshold.

DaemonSet
Pod Priority is a major factor in making eviction decisions. If you do not want the kubelet to
evict pods that belong to a DaemonSet , give those pods a high enough priorityClass in the
pod spec. You can also use a lower priorityClass or the default to only allow DaemonSet
pods to run when there are enough resources.

Known issues
The following sections describe known issues related to out of resource handling.

kubelet may not observe memory pressure right away

By default, the kubelet polls cAdvisor to collect memory usage stats at a regular interval. If
memory usage increases within that window rapidly, the kubelet may not observe
MemoryPressure fast enough, and the OOMKiller will still be invoked.

You can use the --kernel-memcg-notification flag to enable the memcg notification API on
the kubelet to get notified immediately when a threshold is crossed.

If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a
viable workaround for this issue is to use the --kube-reserved and --system-reserved flags
to allocate memory for the system.

active_file memory is not considered as available memory

On Linux, the kernel tracks the number of bytes of file-backed memory on active LRU list as
the active_file statistic. The kubelet treats active_file memory areas as not reclaimable.
For workloads that make intensive use of block-backed local storage, including ephemeral
local storage, kernel-level caches of file and block data means that many recently accessed
cache pages are likely to be counted as active_file . If enough of these kernel block buffers
are on the active LRU list, the kubelet is liable to observe this as high resource use and taint
the node as experiencing memory pressure - triggering pod eviction.

For more details, see https://github.com/kubernetes/kubernetes/issues/43916

You can work around that behavior by setting the memory limit and memory request the
same for containers likely to perform intensive I/O activity. You will need to estimate or
measure an optimal memory limit value for that container.

What's next
Learn about API-initiated Eviction
Learn about Pod Priority and Preemption
Learn about PodDisruptionBudgets
Learn about Quality of Service (QoS)
Check out the Eviction API

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 64/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

13 - API-initiated Eviction
API-initiated eviction is the process by which you use the Eviction API to create an Eviction
object that triggers graceful pod termination.

You can request eviction by calling the Eviction API directly, or programmatically using a client
of the API server, like the kubectl drain command. This creates an Eviction object, which
causes the API server to terminate the Pod.

API-initiated evictions respect your configured PodDisruptionBudgets and

terminationGracePeriodSeconds .

Using the API to create an Eviction object for a Pod is like performing a policy-controlled
DELETE operation on the Pod.

Calling the Eviction API

You can use a Kubernetes language client to access the Kubernetes API and create an
Eviction object. To do this, you POST the attempted operation, similar to the following
example:

policy/v1 policy/v1beta1

Note: policy/v1 Eviction is available in v1.22+. Use policy/v1beta1 with prior

releases.

{
"apiVersion": "policy/v1",
"kind": "Eviction",
"metadata": {
"name": "quux",
"namespace": "default"
}
}

Alternatively, you can attempt an eviction operation by accessing the API using curl or
wget , similar to the following example:

curl -v -H 'Content-type: application/json' https://your-cluster-api-endpoint.exa

How API-initiated eviction works

When you request an eviction using the API, the API server performs admission checks and
responds in one of the following ways:

: the eviction is allowed, the Eviction subresource is created, and the Pod is
200 OK
deleted, similar to sending a DELETE request to the Pod URL.
429 Too Many Requests : the eviction is not currently allowed because of the configured
PodDisruptionBudget. You may be able to attempt the eviction again later. You might
also see this response because of API rate limiting.
500 Internal Server Error : the eviction is not allowed because there is a
misconfiguration, like if multiple PodDisruptionBudgets reference the same Pod.

If the Pod you want to evict isn't part of a workload that has a PodDisruptionBudget, the API
server always returns 200 OK and allows the eviction.

If the API server allows the eviction, the Pod is deleted as follows:

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 65/66
6/6/23, 4:00 PM Scheduling, Preemption and Eviction | Kubernetes

1. The Pod resource in the API server is updated with a deletion timestamp, after which
the API server considers the Pod resource to be terminated. The Pod resource is also
marked with the configured grace period.
2. The kubelet on the node where the local Pod is running notices that the Pod resource is
marked for termination and starts to gracefully shut down the local Pod.
3. While the kubelet is shutting the Pod down, the control plane removes the Pod from
Endpoint and EndpointSlice objects. As a result, controllers no longer consider the Pod
as a valid object.
4. After the grace period for the Pod expires, the kubelet forcefully terminates the local
Pod.
5. The kubelet tells the API server to remove the Pod resource.
6. The API server deletes the Pod resource.

Troubleshooting stuck evictions

In some cases, your applications may enter a broken state, where the Eviction API will only
return 429 or 500 responses until you intervene. This can happen if, for example, a
ReplicaSet creates pods for your application but new pods do not enter a Ready state. You
may also notice this behavior in cases where the last evicted Pod had a long termination grace
period.

If you notice stuck evictions, try one of the following solutions:

Abort or pause the automated operation causing the issue. Investigate the stuck
application before you restart the operation.
Wait a while, then directly delete the Pod from your cluster control plane instead of
using the Eviction API.

What's next
Learn how to protect your applications with a Pod Disruption Budget.
Learn about Node-pressure Eviction.
Learn about Pod Priority and Preemption.

https://kubernetes.io/docs/concepts/scheduling-eviction/_print/ 66/66

System Design Interview Vol2
100% (4)
System Design Interview Vol2
427 pages
Grokking The Advanced System Design Interview
91% (11)
Grokking The Advanced System Design Interview
397 pages
AWS Certified Solutions Architect Professional Slides v6
100% (8)
AWS Certified Solutions Architect Professional Slides v6
823 pages
Emil Koutanov - Effective Kafka - A Hands-On Guide To Building Robust and Scalable Event-Driven Applications With Code Examples in Java (2021)
100% (3)
Emil Koutanov - Effective Kafka - A Hands-On Guide To Building Robust and Scalable Event-Driven Applications With Code Examples in Java (2021)
394 pages
Guide To Clear Spring Boot Microservice Interviews (Free Sample Copy)
No ratings yet
Guide To Clear Spring Boot Microservice Interviews (Free Sample Copy)
41 pages
System Design Interview - An Insider's Guide
90% (10)
System Design Interview - An Insider's Guide
103 pages
Kubernetes Basic To Advance End To End
100% (6)
Kubernetes Basic To Advance End To End
295 pages
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
80% (5)
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
236 pages
Ultimate AWS Certified Solutions Architect Associate (SAA) Udemy
No ratings yet
Ultimate AWS Certified Solutions Architect Associate (SAA) Udemy
10 pages
System Design Interview Fundamentals
100% (4)
System Design Interview Fundamentals
412 pages
Java Interview Questions and Answers Includes Java Version Till Java 12 (BooxRack)
60% (5)
Java Interview Questions and Answers Includes Java Version Till Java 12 (BooxRack)
159 pages
Spring Boot PDF Durasoft Spring Boot Notes Compress
100% (3)
Spring Boot PDF Durasoft Spring Boot Notes Compress
118 pages
Cracking Microservices Interviews v1.1
100% (4)
Cracking Microservices Interviews v1.1
152 pages
6.1) Kubernetes Detailed Notes
100% (3)
6.1) Kubernetes Detailed Notes
75 pages
Securing Microservice APIs
No ratings yet
Securing Microservice APIs
51 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
Spring Boot Tutorial PDF
100% (5)
Spring Boot Tutorial PDF
332 pages
Kubernetes Tutorial
100% (11)
Kubernetes Tutorial
83 pages
CloudFormation Stack Configuration
No ratings yet
CloudFormation Stack Configuration
10 pages
Kubernetes Basic Blog
No ratings yet
Kubernetes Basic Blog
25 pages
Cheat Sheet: Eclipse Vert.x: 4. Timer and Periodic Tasks 5. HTTP
No ratings yet
Cheat Sheet: Eclipse Vert.x: 4. Timer and Periodic Tasks 5. HTTP
12 pages
AWS Interview Questions
No ratings yet
AWS Interview Questions
3 pages
9 - Kubernetes (Light Theme)
No ratings yet
9 - Kubernetes (Light Theme)
11 pages
Kubernetes at A Glimpse 1691937493
No ratings yet
Kubernetes at A Glimpse 1691937493
13 pages
Multithreading in Java (Unit 4)
100% (1)
Multithreading in Java (Unit 4)
19 pages
Final Kubernets Merged
No ratings yet
Final Kubernets Merged
89 pages
Getting Started With Docker: Improve Performance, Minimize Cost
No ratings yet
Getting Started With Docker: Improve Performance, Minimize Cost
7 pages
Kubernetes Notes - Get Start
No ratings yet
Kubernetes Notes - Get Start
26 pages
15 Reasons To Use Redis As An Application Cache: Itamar Haber
No ratings yet
15 Reasons To Use Redis As An Application Cache: Itamar Haber
9 pages
Unit - 1: Cloud Architecture and Model
No ratings yet
Unit - 1: Cloud Architecture and Model
9 pages
HPA and METRIC SERVER K8S
No ratings yet
HPA and METRIC SERVER K8S
5 pages
Apache Kafka Installation
No ratings yet
Apache Kafka Installation
3 pages
Aurora Store
No ratings yet
Aurora Store
19 pages
Gcloud Command Structure
No ratings yet
Gcloud Command Structure
14 pages
AWS All Questions
No ratings yet
AWS All Questions
65 pages
Working With ReplicaSet - Kubernetes Administrator
No ratings yet
Working With ReplicaSet - Kubernetes Administrator
5 pages
Cloud Computing Chapter 3
No ratings yet
Cloud Computing Chapter 3
17 pages
Aws S3
No ratings yet
Aws S3
19 pages
Maven Jenkins Integration
No ratings yet
Maven Jenkins Integration
16 pages
Creating An E-Commerce Site With MERN Stack - Part III - Medium
No ratings yet
Creating An E-Commerce Site With MERN Stack - Part III - Medium
23 pages
08 Reactjs Notes
No ratings yet
08 Reactjs Notes
24 pages
TL DR With Careful Tuning of Your Webcenter Sites Caching Strategy, You Can Dramatically Increase The
No ratings yet
TL DR With Careful Tuning of Your Webcenter Sites Caching Strategy, You Can Dramatically Increase The
16 pages
Velero
No ratings yet
Velero
8 pages
Kubernetes - by Shivansh Vasu
No ratings yet
Kubernetes - by Shivansh Vasu
19 pages
Lec 9 - Theoretical Foundation of Dsitributed System and Consensus
No ratings yet
Lec 9 - Theoretical Foundation of Dsitributed System and Consensus
103 pages
React Fullnotes
No ratings yet
React Fullnotes
44 pages
Step by Step Tutorial To Create Keystore and Truststore File - Tech Brainwave
No ratings yet
Step by Step Tutorial To Create Keystore and Truststore File - Tech Brainwave
13 pages
Kubernetes Cluster
No ratings yet
Kubernetes Cluster
55 pages
# Getting Started With Create React App
No ratings yet
# Getting Started With Create React App
21 pages
Prashanth Dollu: Page 1 of 4
No ratings yet
Prashanth Dollu: Page 1 of 4
4 pages
Improving HA and Long-Term Storage For Prometheus Using Thanos On EKS With S3 - AWS Open Source Blog
No ratings yet
Improving HA and Long-Term Storage For Prometheus Using Thanos On EKS With S3 - AWS Open Source Blog
11 pages
Jenkins Installation On AWS Cloud
No ratings yet
Jenkins Installation On AWS Cloud
5 pages
Docker Q &A
No ratings yet
Docker Q &A
11 pages
Setup Kubernetes
No ratings yet
Setup Kubernetes
3 pages
React Components
No ratings yet
React Components
28 pages
Secure Shell
No ratings yet
Secure Shell
11 pages
Nosql
No ratings yet
Nosql
8 pages
Priority Queue
No ratings yet
Priority Queue
9 pages
FortiADC AWS Deployment Guide
No ratings yet
FortiADC AWS Deployment Guide
87 pages
Chapter 1 IntroDistributed
No ratings yet
Chapter 1 IntroDistributed
143 pages
Installation Istio and Microk8s
No ratings yet
Installation Istio and Microk8s
5 pages
High Availability and Disaster Recovery Kubernetes
No ratings yet
High Availability and Disaster Recovery Kubernetes
6 pages
Maven Tour
No ratings yet
Maven Tour
44 pages
Docker - Kubernetes Readme
No ratings yet
Docker - Kubernetes Readme
10 pages
Garbage Collection - Kubernetes
No ratings yet
Garbage Collection - Kubernetes
4 pages
DDD in Distributed Computing
No ratings yet
DDD in Distributed Computing
5 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
27 pages
AWS Solution Architect Associate Agenda PDF
No ratings yet
AWS Solution Architect Associate Agenda PDF
6 pages
AWS Dumps With Answers Part-1
No ratings yet
AWS Dumps With Answers Part-1
78 pages
Implement Network Security Kubernetes Tigera
No ratings yet
Implement Network Security Kubernetes Tigera
47 pages
Eks With Terraform
No ratings yet
Eks With Terraform
34 pages
Drupal and Container Orchestration - Using Kubernetes To Manage All The Things
No ratings yet
Drupal and Container Orchestration - Using Kubernetes To Manage All The Things
21 pages
Aws Load Balancer
No ratings yet
Aws Load Balancer
7 pages
Jenkins
No ratings yet
Jenkins
3 pages
Extending Puppet - Second Edition
From Everand
Extending Puppet - Second Edition
Alessandro Franceschi
No ratings yet
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
Kubernetes Practicals Ebook
75% (4)
Kubernetes Practicals Ebook
187 pages
AWS Solution Architect
75% (4)
AWS Solution Architect
115 pages
100 Days of Kubernetes
100% (4)
100 Days of Kubernetes
121 pages
Automating Stateful Applications With Kubernetes Operators Josh Wood Ryan Jarvinen Red Hat
No ratings yet
Automating Stateful Applications With Kubernetes Operators Josh Wood Ryan Jarvinen Red Hat
28 pages
The Top 6 Microservices Patterns
82% (11)
The Top 6 Microservices Patterns
19 pages
Design Microservices v2
No ratings yet
Design Microservices v2
679 pages
Spring Transactions
No ratings yet
Spring Transactions
22 pages
Kubernetes Deployments
No ratings yet
Kubernetes Deployments
5 pages
Command Line Tool (Kubectl) - Kubernetes
No ratings yet
Command Line Tool (Kubectl) - Kubernetes
35 pages
Kubernetes Taints Tolerations PDF
No ratings yet
Kubernetes Taints Tolerations PDF
41 pages
AWS Certified DevOps Engineer Professional... Tests 2021
100% (3)
AWS Certified DevOps Engineer Professional... Tests 2021
210 pages
Microservices Design Principles Slides
100% (2)
Microservices Design Principles Slides
42 pages
Lab-2 (Basic Jenkins Operation)
No ratings yet
Lab-2 (Basic Jenkins Operation)
16 pages
Terraform Provisioners
No ratings yet
Terraform Provisioners
6 pages
Terraform Commands
No ratings yet
Terraform Commands
5 pages
Tutorials - Kubernetes
100% (2)
Tutorials - Kubernetes
155 pages
Extending Kubernetes - Kubernetes
No ratings yet
Extending Kubernetes - Kubernetes
30 pages
Services, Load Balancing, and Networking - Kubernetes
No ratings yet
Services, Load Balancing, and Networking - Kubernetes
79 pages
Unit III NOTES DRONE
No ratings yet
Unit III NOTES DRONE
11 pages
Zi UOl TQ 1 SD KW4 e 8 JX HZFNu 4 BR Oube 03 of 1 TJ 9 y QP
No ratings yet
Zi UOl TQ 1 SD KW4 e 8 JX HZFNu 4 BR Oube 03 of 1 TJ 9 y QP
20 pages
Hypervisor Case Study
No ratings yet
Hypervisor Case Study
8 pages
IDCUBE Brochure Ezvisit v1.2
No ratings yet
IDCUBE Brochure Ezvisit v1.2
4 pages
Sample Resume Information Systems
No ratings yet
Sample Resume Information Systems
1 page
Master Data - Material Master Data
No ratings yet
Master Data - Material Master Data
82 pages
Hetvi FINAL DL Research Paper
No ratings yet
Hetvi FINAL DL Research Paper
5 pages
Week 3 & Week 4 Control Structures
No ratings yet
Week 3 & Week 4 Control Structures
21 pages
HRSCC 2025 94021
No ratings yet
HRSCC 2025 94021
1 page
Unit 5 (Control Statements)
No ratings yet
Unit 5 (Control Statements)
51 pages
An Introduction To Alice and Object-Oriented Programming
No ratings yet
An Introduction To Alice and Object-Oriented Programming
26 pages
Sap Successfactors
No ratings yet
Sap Successfactors
4 pages
3D Mapping For Needs of Architecture
No ratings yet
3D Mapping For Needs of Architecture
10 pages
Web Intern Hiring
No ratings yet
Web Intern Hiring
2 pages
9 It
No ratings yet
9 It
3 pages
Portfolio - Laila Radwan
No ratings yet
Portfolio - Laila Radwan
18 pages
Rohini oneM2M IoT Arch
No ratings yet
Rohini oneM2M IoT Arch
6 pages
Build A Raspberry Pi SUPER Weather Station
No ratings yet
Build A Raspberry Pi SUPER Weather Station
18 pages
Quiz - Software Engineering 2021
No ratings yet
Quiz - Software Engineering 2021
19 pages
RTL Architect: RTL Architect's "Shift-Left" Strategy Significantly Reduces Time-To-Feedback
No ratings yet
RTL Architect: RTL Architect's "Shift-Left" Strategy Significantly Reduces Time-To-Feedback
3 pages
Server Socket Code-Cpp
No ratings yet
Server Socket Code-Cpp
5 pages
Beautiful Necklaces: Input
No ratings yet
Beautiful Necklaces: Input
2 pages
ANSYS Fluent Tutorial Guide r170
No ratings yet
ANSYS Fluent Tutorial Guide r170
1,242 pages
Integrative Programming & Technologies 1
No ratings yet
Integrative Programming & Technologies 1
23 pages
29 - Card - Batch Card Hotting - Vishnu
No ratings yet
29 - Card - Batch Card Hotting - Vishnu
47 pages
TMI 402 Mobile Application Design and Development M.SC IT III
No ratings yet
TMI 402 Mobile Application Design and Development M.SC IT III
3 pages
Hardware Software
No ratings yet
Hardware Software
29 pages
AI Virtual Mouse: Project By: Gaurav Munjewar Geetesh Kongre
No ratings yet
AI Virtual Mouse: Project By: Gaurav Munjewar Geetesh Kongre
9 pages
Principle of Communication Lab Manual: Frequency Modulation
No ratings yet
Principle of Communication Lab Manual: Frequency Modulation
5 pages
BAB I Program Profesional - Toko Kue Dorayaki
No ratings yet
BAB I Program Profesional - Toko Kue Dorayaki
9 pages