Default scheduler rules

Identify if a node has adequate hardware resources
Check if a node is running out of resources. check for memory or disk pressure conditions
Check if a pod schedule is scheduled to a node by a name
Check if a node has a label matching node selector in a pod spec
Check if a pod is requesting to bound to a specific host port and if so, does the node have that port available
Check if a pod is requesting a certain type of volume be mounted and if other pods are using the same volume
Check if a pod tolerates taints of the node, eg. master nodes is tainted with "noSchedule"
Check if a pod or a node affinity rules and checking if scheduling the pod would break these rules
If there is more than one node could schedule a pod, the scheduler priorities the nodes and choose the best one. If they have the same priority it chooses in round-robin fashion.

Label nodes

kubectl label node worker1.acme.com share-type=dedicated

YAML for the deployment to include the node affinity rules:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pref
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: pref
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: #all pods,but not current pod on the node
          - weight: 80
            preference:
              matchExpressions:
              - key: availability-zone
                operator: In
                values:
                - zone1
          - weight: 20              #4 time less priority then AZ
            preference:
              matchExpressions:
              - key: share-type     #label key
                operator: In
                values:
                - dedicated         #label value
      containers:
      - args:
        - sleep
        - "999"
        image: busybox:v1.28.4
        name: main

Cluster nodes capacity and resources

Check node's capacity

kubectl describe nodes worker-2.acme.com | grep -A 20 Capacity:

...(output omitted)...

Capacity:

cpu:                2
ephemeral-storage:  20263528Ki
hugepages-2Mi:      0
memory:             4044936Ki
pods:               110

Allocatable:

cpu:                2
ephemeral-storage:  18674867374
hugepages-2Mi:      0
memory:             3942536Ki
pods:               110

System Info:

Machine ID:                 ******c49b4bed31684a******
System UUID:                ******-D110-CB50-EAA3-*******
Boot ID:                    ****8-be21-45ca-b86c-311a479******
Kernel Version:             4.4.0-1087-aws
OS Image:                   Ubuntu 16.04.6 LTS
Operating System:           linux
Architecture:               amd64
Container Runtime Version:  docker://18.6.1
Kubelet Version:            v1.13.10
Kube-Proxy Version:         v1.13.10

PodCIDR: 10.100.1.0/24 Non-terminated Pods: (3 in total)

 Namespace                  Name                           CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
 ---------                  ----                           ------------  ----------  ---------------  -------------  ---
 default                    busybox                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         18h
 kube-system                kube-flannel-ds-amd64-p7c7m    100m (5%)     100m (5%)   50Mi (2%)        50Mi (2%)      5d9h
 kube-system                kube-proxy-27dbb               0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d9h

Allocated resources:

 (Total limits may be over 100 percent, i.e., overcommitted.)
 Resource           Requests   Limits
 --------           --------   ------
 cpu                100m (5%)  100m (5%)
 memory             50Mi (2%)  50Mi (2%)
 ephemeral-storage  0 (0%)     0 (0%)

...(output omitted)...

</syntaxhighlightjs>

Resource quotas | Resource request and limit

resource request: is what a pods is guaranteed to get; it's the amount of resources necessary to run a container; a pod will only be scheduled on a node that can give that resource
resource limit: makes sure that a container never goes above a value specified, they allow to up to the limit then are resticted

Exceeding a memory limit makes your container process a candidate for oom-killing
process basically can’t exceed the set cpu quota, and will never get evicted for trying to use more cpu time than allocated. The system enforces the quota at the scheduler so the process just gets throttled at the limit.

Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to:

1 AWS vCPU, 1 GCP Core, 1 Azure vCore, 1 IBM vCPU
1 Hyperthread on a bare-metal Intel processor with Hyperthreading
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

Resources type (compressed and not-compressed resource)

CPU (compressed resource) - once limit has been reached Kubernetes will start throttling CPU of the container process(es). The pod won't get terminated or evicted
Memory (not compressible resource) - once limit has been reached the pod will get terminated

Units k8s docs

memory 64Mi - measures in bytes, it means 64 Mebibytes
cpu 250m - measured in cores, it means 250 miliCPUs or 0.25 CPU core

Note: 1 MiB = 2²⁰ bytes = 1048576bytes = 1024 kibibytes

Requests and limits

spec:
  containers:
  - name: hello-world-container
    image: paulbouwer/hello-kubernetes:1.5
    resources:
      limits:
        cpu: "0.9"    # Throttling if tries to use more
        memory: 512Mi # OOM kill if tries to use more
      requests:
        cpu: "500m"   # Info for scheduling and Docker. Chances for
        memory: 256Mi # eviction increase if we use more than requested.

Schedule a pod with resources request on the specific node <syntaxhighlightjs lang=yaml> apiVersion: v1 kind: Pod metadata:

 name: resource-pod1

spec:

 nodeSelector:
   kubernetes.io/hostname: "worker-2.acme.com"
 containers:
 - image: busybox
   command: ["dd", "if=/dev/zero", "of=/dev/null"]
   name: budybox-dd
   resources:
     requests:
       cpu: 800m     # mili cores -> 2000m=20% of cpu (large for 2nd deployment)
       memory: 20Mi  # Mb

</syntaxhighlightjs>

Create a pod and watch resource request balance changing

kubectl apply -f resource-pod1.yml
watch -d 'kubectl describe nodes worker-2.acme.com  | grep -A 25 Non-terminated'
Non-terminated Pods: (6 in total)
  Namespace          Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------          ----                                ------------  ----------  ---------------  -------------  ---
  default            busybox                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         24h
  default            nginx-loadbalancer-86bb844fb7-bl5fs 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  default         -->resource-pod1<--                    800m (40%)    0 (0%)      20Mi (0%)        0 (0%)         6m7s
  kube-system        kube-flannel-ds-amd64-97hvr         100m (5%)     100m (5%)   50Mi (1%)        50Mi (1%)      14d
  kube-system        kube-proxy-fxl6f                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  rbac1              test-f57db4bfd-ghshj                0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                900m (45%)  100m (5%) # <- resources balance
  memory             70Mi (1%)   50Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)

Then deploying another pod, requesting 2000mi cpus, will end up with scheduling error when describing the pod

kubectl describe pod pod2-2000mi-cpu
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  44s (x10 over 4m57s)  default-scheduler  0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient cpu.

Limits YAML, Unlike requests, limits can go above total utilisation of CPU and memory. K8s will detect if overcommitted and kill the pod. Be aware Containers within pods are not aware of limits sets to pods, this can be seen from top command within a container. <syntaxhighlightjs lang=yaml> apiVersion: v1 kind: Pod metadata:

 name: pod-limit-resources

spec:

 containers:
 - image: busybox
   command: ["dd", "if=/dev/zero", "of=/dev/null"]
   name: main
   resources:
     limits:
       cpu: 2        #by default requests are eq limits if not specified
       memory: 40Mi  #

</syntaxhighlightjs>

QoS

BestEffort - If no requests and nolimits are setneitherfor memorynorforCPU, innoneof thecontainers.
Guaranteed - If limits == requestsare set forbothmemory and CPU onallcontainers.(Requests defaults tolimits, so it’s enough toset limits.)
Burstable - In all other cases.E.g.: limits aredifferent thanrequests for somecontainer or limitsare simply unset foran other.

The QoS prioritization of a pod can be debugged with

kubectl describe pod xxx | grep QoS

OOM scores are adjusted by Kubernetes, so that the system evictsunruly pods in the following order:

BestEffort
Burstable
Guaranteed
Kubelet, Docker

Kubernetes pod to node scheduling

Tains and tolerations

Visual Guide to Node Selectors, Taints and Tolerations

Tains and tolerations are restrictions to what PODs can be scheduled on nodes. By default Pods have no tolerations set up.

Taint effects (what happen to intolerant pods, PODs with a no toleration in .spec.toleratins)

NoSchedule - it simply won't be scheduled
PreferNoSchedule - the system will try to avoid placing a pod but it's not warrantied
NoExecute - won't schedule and already running pods will be evicted if don't tolerate the taint, applies to pods that have been scheduled before taint was applied to a node

# Add taint on a node
kubectl taint nodes <node-name> <key>=<value>:<taint-effect>
kubectl taint nodes node1        type=workers:NoSchedule
                              
# Add toleration to a pod
apiVersion:
kind: Pod
metadata:
  name: worker-pod
spec:
  containers:
  - name: nginx
    image: nginx
  tolerations:
  - key: type
    operator: Equal
    value: workers
    effect: NoSchedule

`nodeAffinity` and `podAntiAffinity`

This scenario tries to schedule (prefer) on nodes with label "node.kubernetes.io/lifecycle=normal" but also schedule some pods on other nodes (podAntiAffinity) even if nodeAffinity does not match. This is to avoid that all pods would run on the same node (eg. only one node matching nodeAffinity) and other nodes are available.

kubectl apply -f <(cat <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment-normal-10-podantiaffinity
  labels:
    app: nginx
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 75% # it was 25% 
  replicas: 10
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        version: "1.6"
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            preference:
              matchExpressions:
              - key: node.kubernetes.io/lifecycle
                operator: In
                values:
                - normal
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app # security
                  operator: In
                  values:
                  - nginx
              topologyKey: node.kubernetes.io/lifecycle # topology.kubernetes.io/zone
EOF
) --dry-run=server

Testing and results

With the setup above nodeAffinity:preferredDuringSchedulingIgnoredDuringExecution on `spot` and podAntiAffinity:preferredDuringSchedulingIgnoredDuringExecution if label ` nginx` match, with replicas of 5 A: nodes in setup: 2x`normal` and 1x`spot` scheduler has created 4 pods on `spot` nodeAffinity(matching) and 1 pod on node `normal` podAntiAffinity(match), this is what I'd expect.

Similar scenario 12 nodes with deployment of 10x replicas, which 2x nodes have the prefered labels and 10x other nodes; A: 6x pods have scheduled on the preferred nodes (3pods each node), and 4 on other nodes

Summary

Resource Type	Use Cases	Pros	Cons	Best Practices
nodeSelector	Assigning pods to nodes with specific labels	Easy to use, small changes to the PodSpec	Does not support logical operators,hard to extend with complex scheduling rules	This resource should be used only in the early versions of K8s before introduction of node affinity
Node affinity	Implementing data locality, running pods on nodes with dedicated software	Expressive syntax with logical operators,fine-grained control over pod placement rules,support for “hard” and “soft” pod placement rules	Requires modification of existing pods to change behavior	Use a combination of “hard” and “soft” rules to cover different use cases and scenarios
Inter-pod affinity	Colocation of pods in the co-dependent service,enabling data locality	The same as for node affinity	Requires modification of existing pods to change behavior	Proper pod label management and documentation of labels used
Pod anti-affinity	Enabling high availability (via pod distribution),preventing inter-service competition for resources	Fine-grained control over inter-pod repel behavior,support for hard and soft pod anti-affinity rules	Requires modification of existing pods to change behavior	Similar to node affinity
Taints and tolerations	Nodes with dedicated software,separation of team resources, etc.	Does not require modification of existing pods,supports automatic eviction of pods without required toleration,supports different taint effects	Does not support expressive syntax using logical operators	Be careful when applying multiple taints to the node,ensure that the pods you need have required tolerations

Deamonset

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created. DaemonSets do not use a scheduler to deploy pods, DS used to ignore nodes taints. Some typical uses of a DaemonSet are:

running a cluster storage daemon, such as glusterd, ceph, on each node.
running a logs collection daemon on every node, such as fluentd or logstash.
running a node monitoring daemon on every node, such as Prometheus Node Exporter, collectd

Monitor events and logs

kubectl get events --all-namespaces
kubectl get events --watch #short -w

#See scheduler logs
kubectl logs [kube_scheduler_pod_name] -n kube-system
tail -f /var/log/kube-scheduler.log #run on control plane node

Kubernetes/Scheduling

Contents

Default scheduler rules

Label nodes

Cluster nodes capacity and resources

Resource quotas | Resource request and limit

QoS

Kubernetes pod to node scheduling

Tains and tolerations

`nodeAffinity` and `podAntiAffinity`

Summary

Deamonset

Monitor events and logs

Resources

Navigation menu

Kubernetes/Scheduling

Default scheduler rules

Label nodes

Cluster nodes capacity and resources

Resource quotas | Resource request and limit

QoS

Kubernetes pod to node scheduling

Tains and tolerations

nodeAffinity and podAntiAffinity

Summary

Deamonset

Monitor events and logs

Resources

Navigation menu

Search

`nodeAffinity` and `podAntiAffinity`