Actions

Kubernetes/Scheduling

From Ever changing code

< Kubernetes

Default scheduler rules

  1. Identify if a node has adequate hardware resources
  2. Check if a node is running out of resources. check for memory or disk pressure conditions
  3. Check if a pod schedule is scheduled to a node by a name
  4. Check if a node has a label matching node selector in a pod spec
  5. Check if a pod is requesting to bound to a specific host port and if so, does the node have that port available
  6. Check if a pod is requesting a certain type of volume be mounted and if other pods are using the same volume
  7. Check if a pod tolerates taints of the node, eg. master nodes is tainted with "noSchedule"
  8. Check if a pod or a node affinity rules and checking if scheduling the pod would break these rules
  9. If there is more than one node could schedule a pod, the scheduler priorities the nodes and choose the best one. If they have the same priority it chooses in round-robin fashion.

Label nodes

kubectl label node worker1.acme.com share-type=dedicated


YAML for the deployment to include the node affinity rules:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: pref
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: pref
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution: #all pods,but not current pod on the node
          - weight: 80
            preference:
              matchExpressions:
              - key: availability-zone
                operator: In
                values:
                - zone1
          - weight: 20              #4 time less priority then AZ
            preference:
              matchExpressions:
              - key: share-type     #label key
                operator: In
                values:
                - dedicated         #label value
      containers:
      - args:
        - sleep
        - "999"
        image: busybox:v1.28.4
        name: main

Capacity and resources

Check node's capacity

kubectl describe nodes worker-2.acme.com | grep -A 20 Capacity:
Capacity:
 cpu:                2
 ephemeral-storage:  20263528Ki
 hugepages-2Mi:      0
 memory:             4044936Ki
 pods:               110
Allocatable:
 cpu:                2
 ephemeral-storage:  18674867374
 hugepages-2Mi:      0
 memory:             3942536Ki
 pods:               110
System Info:
 Machine ID:                 ******c49b4bed31684a******
 System UUID:                ******-D110-CB50-EAA3-*******
 Boot ID:                    ****8-be21-45ca-b86c-311a479******
 Kernel Version:             4.4.0-1087-aws
 OS Image:                   Ubuntu 16.04.6 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://18.6.1
...


YAML to schedule request to a specific node and request specific resources

apiVersion: v1
kind: Pod
metadata:
  name: resource-pod1
spec:
  nodeSelector:
    kubernetes.io/hostname: "worker-2.acme.com"
  containers:
  - image: busybox
    command: ["dd", "if=/dev/zero", "of=/dev/null"]
    name: budybox-dd
    resources:
      requests:
        cpu: 800m     #mili cores -> 2000m (large for 2nd deployment)
        memory: 20Mi  #Mb


Deploy above

kubectl describe nodes worker-2.acme.com 
Non-terminated Pods:         (6 in total)
  Namespace                  Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                   ------------  ----------  ---------------  -------------  ---
  default                    busybox                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         24h
  default                    nginx-loadbalancer-86bb844fb7-bl5fs    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d
  default                    resource-pod1                          800m (40%)    0 (0%)      20Mi (0%)        0 (0%)         6m7s
  kube-system                kube-flannel-ds-amd64-97hvr            100m (5%)     100m (5%)   50Mi (1%)        50Mi (1%)      14d
  kube-system                kube-proxy-fxl6f                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  rbac1                      test-f57db4bfd-ghshj                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         12d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                900m (45%)  100m (5%)
  memory             70Mi (1%)   50Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)


Then deploying another pod, requesting 2000mi cpus, will end up with scheduling error when describing the pod

kubectl describe pod pod2-2000mi-cpu
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  44s (x10 over 4m57s)  default-scheduler  0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient cpu.


Limits YAML, Unlike requests, limits can go above total utilisation of CPU and memory. K8s will detect if overcommitted and kill the pod. Be aware Containers within pods are not aware of limits sets to pods, this can be seen from top command within a container.

apiVersion: v1
kind: Pod
metadata:
  name: pod-limit-resources
spec:
  containers:
  - image: busybox
    command: ["dd", "if=/dev/zero", "of=/dev/null"]
    name: main
    resources:
      limits:
        cpu: 2        #by default requests are eq limits if not specified
        memory: 40Mi  #

Deamonset

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created. DaemonSets do not use a scheduler to deploy pods, DS used to ignore nodes taints. Some typical uses of a DaemonSet are:

  • running a cluster storage daemon, such as glusterd, ceph, on each node.
  • running a logs collection daemon on every node, such as fluentd or logstash.
  • running a node monitoring daemon on every node, such as Prometheus Node Exporter, collectd

</source>

Monitor events and logs

kubectl get events --all-namespaces
kubectl get events --watch #short -w

#See scheduler logs
kubectl logs [kube_scheduler_pod_name] -n kube-system
tail -f /var/log/kube-scheduler.log #run on control plane node

Resources