Difference between revisions of "Kubernetes/Scheduling"
(18 intermediate revisions by the same user not shown) | |||
Line 53: | Line 53: | ||
</source> | </source> | ||
= | = Cluster nodes capacity and resources = | ||
Check node's capacity | Check node's capacity | ||
<source lang=bash> | <source lang=bash> | ||
kubectl describe nodes worker-2.acme.com | grep -A 20 Capacity: | kubectl describe nodes worker-2.acme.com | grep -A 20 Capacity: | ||
</source> | |||
<syntaxhighlightjs lang=yaml> | |||
# ...(output omitted)... | |||
Capacity: | Capacity: | ||
cpu: 2 | cpu: 2 | ||
Line 78: | Line 81: | ||
Architecture: amd64 | Architecture: amd64 | ||
Container Runtime Version: docker://18.6.1 | Container Runtime Version: docker://18.6.1 | ||
... | Kubelet Version: v1.13.10 | ||
Kube-Proxy Version: v1.13.10 | |||
PodCIDR: 10.100.1.0/24 | |||
Non-terminated Pods: (3 in total) | |||
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE | |||
--------- ---- ------------ ---------- --------------- ------------- --- | |||
default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h | |||
kube-system kube-flannel-ds-amd64-p7c7m 100m (5%) 100m (5%) 50Mi (2%) 50Mi (2%) 5d9h | |||
kube-system kube-proxy-27dbb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h | |||
Allocated resources: | |||
(Total limits may be over 100 percent, i.e., overcommitted.) | |||
Resource Requests Limits | |||
-------- -------- ------ | |||
cpu 100m (5%) 100m (5%) | |||
memory 50Mi (2%) 50Mi (2%) | |||
ephemeral-storage 0 (0%) 0 (0%) | |||
# ...(output omitted)... | |||
</syntaxhighlightjs> | |||
== [https://kubernetes.io/docs/concepts/policy/resource-quotas/#enabling-resource-quota Resource quotas] | [https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b Resource request and limit] == | |||
;resource request: is what a '''pods is guaranteed to get'''; it's the amount of resources necessary to run a container; a pod will only be scheduled on a node that can give that resource | |||
;resource limit: makes sure that a container never goes above a value specified, they allow to up to the limit then are resticted | |||
* '''Exceeding a memory limit''' makes your container process a candidate for ''oom-killing'' | |||
* process basically can’t '''exceed the set cpu quota''', and will never get evicted for trying to use more cpu time than allocated. The system enforces the quota at the scheduler so the process just gets throttled at the limit. | |||
Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to: | |||
* 1 AWS vCPU, 1 GCP Core, 1 Azure vCore, 1 IBM vCPU | |||
* 1 Hyperthread on a bare-metal Intel processor with Hyperthreading | |||
* CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine. | |||
Resources type (compressed and not-compressed resource) | |||
* CPU (compressed resource) - once limit has been reached Kubernetes will start throttling CPU of the container process(es). The pod won't get terminated or evicted | |||
* Memory (''not compressible resource'') - once limit has been reached the pod will get '''terminated''' | |||
[https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container Units] k8s docs | |||
* memory <code>64Mi</code> - measures in bytes, it means 64 Mebibytes | |||
* cpu <code>250m</code> - measured in ''cores'', it means 250 miliCPUs or 0.25 CPU core | |||
Note: 1 MiB = 2<sup>20</sup> bytes = 1048576bytes = 1024 kibibytes | |||
Requests and limits | |||
<source lang=bash> | |||
spec: | |||
containers: | |||
- name: hello-world-container | |||
image: paulbouwer/hello-kubernetes:1.5 | |||
resources: | |||
limits: | |||
cpu: "0.9" # Throttling if tries to use more | |||
memory: 512Mi # OOM kill if tries to use more | |||
requests: | |||
cpu: "500m" # Info for scheduling and Docker. Chances for | |||
memory: 256Mi # eviction increase if we use more than requested. | |||
</source> | </source> | ||
< | Schedule a pod with ''resources request'' on the specific node | ||
<syntaxhighlightjs lang=yaml> | |||
apiVersion: v1 | apiVersion: v1 | ||
kind: Pod | kind: Pod | ||
Line 98: | Line 156: | ||
resources: | resources: | ||
requests: | requests: | ||
cpu: 800m #mili cores -> 2000m (large for 2nd deployment) | cpu: 800m # mili cores -> 2000m=20% of cpu (large for 2nd deployment) | ||
memory: 20Mi #Mb | memory: 20Mi # Mb | ||
</ | </syntaxhighlightjs> | ||
Create a pod and watch resource request balance changing | |||
<source lang=bash> | <source lang=bash> | ||
kubectl describe nodes worker-2.acme.com | kubectl apply -f resource-pod1.yml | ||
Non-terminated Pods: | watch -d 'kubectl describe nodes worker-2.acme.com | grep -A 25 Non-terminated' | ||
Namespace | Non-terminated Pods: (6 in total) | ||
--------- | Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE | ||
default | --------- ---- ------------ ---------- --------------- ------------- --- | ||
default | default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h | ||
default | default nginx-loadbalancer-86bb844fb7-bl5fs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d | ||
kube-system | default -->resource-pod1<-- 800m (40%) 0 (0%) 20Mi (0%) 0 (0%) 6m7s | ||
kube-system | kube-system kube-flannel-ds-amd64-97hvr 100m (5%) 100m (5%) 50Mi (1%) 50Mi (1%) 14d | ||
rbac1 | kube-system kube-proxy-fxl6f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d | ||
rbac1 test-f57db4bfd-ghshj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12d | |||
Allocated resources: | Allocated resources: | ||
(Total limits may be over 100 percent, i.e., overcommitted.) | (Total limits may be over 100 percent, i.e., overcommitted.) | ||
Resource Requests Limits | Resource Requests Limits | ||
-------- -------- ------ | -------- -------- ------ | ||
cpu 900m (45%) 100m (5%) | cpu 900m (45%) 100m (5%) # <- resources balance | ||
memory 70Mi (1%) 50Mi (1%) | memory 70Mi (1%) 50Mi (1%) | ||
ephemeral-storage 0 (0%) 0 (0%) | ephemeral-storage 0 (0%) 0 (0%) | ||
Line 136: | Line 195: | ||
Limits YAML, Unlike requests, limits can go above total utilisation of CPU and memory. K8s will detect if overcommitted and kill the pod. Be aware Containers within pods are not aware of limits sets to pods, this can be seen from <code>top</code> command within a container. | Limits YAML, Unlike requests, limits can go above total utilisation of CPU and memory. K8s will detect if overcommitted and kill the pod. Be aware Containers within pods are not aware of limits sets to pods, this can be seen from <code>top</code> command within a container. | ||
< | <syntaxhighlightjs lang=yaml> | ||
apiVersion: v1 | apiVersion: v1 | ||
kind: Pod | kind: Pod | ||
Line 150: | Line 209: | ||
cpu: 2 #by default requests are eq limits if not specified | cpu: 2 #by default requests are eq limits if not specified | ||
memory: 40Mi # | memory: 40Mi # | ||
</syntaxhighlightjs> | |||
= QoS = | |||
*BestEffort - If no requests and nolimits are setneitherfor memorynorforCPU, innoneof thecontainers. | |||
*Guaranteed - If limits == requestsare set forbothmemory and CPU onallcontainers.(Requests defaults tolimits, so it’s enough toset limits.) | |||
* Burstable - In all other cases.E.g.: limits aredifferent thanrequests for somecontainer or limitsare simply unset foran other. | |||
The QoS prioritization of a pod can be debugged with | |||
<source lang=bash> | |||
kubectl describe pod xxx | grep QoS | |||
</source> | |||
OOM scores are adjusted by Kubernetes, so that the system evictsunruly pods in the following order: | |||
* BestEffort | |||
* Burstable | |||
* Guaranteed | |||
* Kubelet, Docker | |||
= Kubernetes pod to node scheduling = | |||
== Tains and tolerations == | |||
* [https://deploy.live/blog/visual-guide-to-node-selectors-taints-and-tolerations/ Visual Guide to Node Selectors, Taints and Tolerations] | |||
Tains and tolerations are restrictions to what PODs can be scheduled on nodes. By default Pods have no tolerations set up. | |||
Taint effects (what happen to intolerant pods, PODs with a no toleration in .spec.toleratins) | |||
* NoSchedule - it simply won't be scheduled | |||
* PreferNoSchedule - the system will try to avoid placing a pod but it's not warrantied | |||
* NoExecute - won't schedule and already running pods will be evicted if don't tolerate the taint, applies to pods that have been scheduled before taint was applied to a node | |||
<source lang=bash> | |||
# Add taint on a node | |||
kubectl taint nodes <node-name> <key>=<value>:<taint-effect> | |||
kubectl taint nodes node1 type=workers:NoSchedule | |||
# Add toleration to a pod | |||
apiVersion: | |||
kind: Pod | |||
metadata: | |||
name: worker-pod | |||
spec: | |||
containers: | |||
- name: nginx | |||
image: nginx | |||
tolerations: | |||
- key: type | |||
operator: Equal | |||
value: workers | |||
effect: NoSchedule | |||
</source> | |||
== <code>nodeAffinity</code> and <code>podAntiAffinity</code> == | |||
This scenario tries to schedule (prefer) on nodes with label "node.kubernetes.io/lifecycle=normal" but also schedule some pods on other nodes (podAntiAffinity) even if nodeAffinity does not match. This is to avoid that all pods would run on the same node (eg. only one node matching nodeAffinity) and other nodes are available. | |||
<source lang=yaml> | |||
kubectl apply -f <(cat <<EOF | |||
apiVersion: apps/v1 | |||
kind: Deployment | |||
metadata: | |||
name: nginx-deployment-normal-10-podantiaffinity | |||
labels: | |||
app: nginx | |||
spec: | |||
strategy: | |||
type: RollingUpdate | |||
rollingUpdate: | |||
maxUnavailable: 25% | |||
maxSurge: 75% # it was 25% | |||
replicas: 10 | |||
selector: | |||
matchLabels: | |||
app: nginx | |||
template: | |||
metadata: | |||
labels: | |||
app: nginx | |||
version: "1.6" | |||
spec: | |||
containers: | |||
- name: nginx | |||
image: nginx:1.14.2 | |||
ports: | |||
- containerPort: 80 | |||
affinity: | |||
nodeAffinity: | |||
preferredDuringSchedulingIgnoredDuringExecution: | |||
- weight: 1 | |||
preference: | |||
matchExpressions: | |||
- key: node.kubernetes.io/lifecycle | |||
operator: In | |||
values: | |||
- normal | |||
podAntiAffinity: | |||
preferredDuringSchedulingIgnoredDuringExecution: | |||
- weight: 100 | |||
podAffinityTerm: | |||
labelSelector: | |||
matchExpressions: | |||
- key: app # security | |||
operator: In | |||
values: | |||
- nginx | |||
topologyKey: node.kubernetes.io/lifecycle # topology.kubernetes.io/zone | |||
EOF | |||
) --dry-run=server | |||
</source> | </source> | ||
;Testing and results | |||
With the setup above nodeAffinity:preferredDuringSchedulingIgnoredDuringExecution on `spot` and podAntiAffinity:preferredDuringSchedulingIgnoredDuringExecution if label ` nginx` match, with replicas of 5 | |||
A: nodes in setup: 2x`normal` and 1x`spot` scheduler has created 4 pods on `spot` nodeAffinity(matching) and 1 pod on node `normal` podAntiAffinity(match), this is what I'd expect. | |||
Similar scenario 12 nodes with deployment of 10x replicas, which 2x nodes have the prefered labels and 10x other nodes; | |||
A: 6x pods have scheduled on the preferred nodes (3pods each node), and 4 on other nodes | |||
== Summary == | |||
{| class="wikitable" | |||
|- style="font-weight:bold;" | |||
! Resource Type | |||
! Use Cases | |||
! Pros | |||
! Cons | |||
! Best Practices | |||
|- | |||
| style="font-weight:bold;" | nodeSelector | |||
| style="vertical-align:middle;" | Assigning pods to nodes with specific labels | |||
| style="vertical-align:middle;" | Easy to use, small changes to the PodSpec | |||
| style="vertical-align:middle;" | Does not support logical operators,hard to extend with complex scheduling rules | |||
| style="vertical-align:middle;" | This resource should be used only in the early versions of K8s before introduction of node affinity | |||
|- | |||
| style="font-weight:bold;" | Node affinity | |||
| style="vertical-align:middle;" | Implementing data locality, running pods on nodes with dedicated software | |||
| style="vertical-align:middle;" | Expressive syntax with logical operators,fine-grained control over pod placement rules,support for “hard” and “soft” pod placement rules | |||
| style="vertical-align:middle;" | Requires modification of existing pods to change behavior | |||
| style="vertical-align:middle;" | Use a combination of “hard” and “soft” rules to cover different use cases and scenarios | |||
|- | |||
| style="font-weight:bold;" | Inter-pod affinity | |||
| style="vertical-align:middle;" | Colocation of pods in the co-dependent service,enabling data locality | |||
| style="vertical-align:middle;" | The same as for node affinity | |||
| style="vertical-align:middle;" | Requires modification of existing pods to change behavior | |||
| style="vertical-align:middle;" | Proper pod label management and documentation of labels used | |||
|- | |||
| style="font-weight:bold;" | Pod anti-affinity | |||
| style="vertical-align:middle;" | Enabling high availability (via pod distribution),preventing inter-service competition for resources | |||
| style="vertical-align:middle;" | Fine-grained control over inter-pod repel behavior,support for hard and soft pod anti-affinity rules | |||
| style="vertical-align:middle;" | Requires modification of existing pods to change behavior | |||
| style="vertical-align:middle;" | Similar to node affinity | |||
|- | |||
| style="font-weight:bold;" | Taints and tolerations | |||
| style="vertical-align:middle;" | Nodes with dedicated software,separation of team resources, etc. | |||
| style="vertical-align:middle;" | Does not require modification of existing pods,supports automatic eviction of pods without required toleration,supports different taint effects | |||
| style="vertical-align:middle;" | Does not support expressive syntax using logical operators | |||
| style="vertical-align:middle;" | Be careful when applying multiple taints to the node,ensure that the pods you need have required tolerations | |||
|} | |||
= Deamonset = | = Deamonset = | ||
Line 158: | Line 372: | ||
*running a logs collection daemon on every node, such as fluentd or logstash. | *running a logs collection daemon on every node, such as fluentd or logstash. | ||
*running a node monitoring daemon on every node, such as Prometheus Node Exporter, collectd | *running a node monitoring daemon on every node, such as Prometheus Node Exporter, collectd | ||
= Monitor events and logs = | = Monitor events and logs = |
Latest revision as of 21:55, 6 January 2022
Default scheduler rules
- Identify if a node has adequate hardware resources
- Check if a node is running out of resources. check for memory or disk pressure conditions
- Check if a pod schedule is scheduled to a node by a name
- Check if a node has a label matching node selector in a pod spec
- Check if a pod is requesting to bound to a specific host port and if so, does the node have that port available
- Check if a pod is requesting a certain type of volume be mounted and if other pods are using the same volume
- Check if a pod tolerates taints of the node, eg. master nodes is tainted with "noSchedule"
- Check if a pod or a node affinity rules and checking if scheduling the pod would break these rules
- If there is more than one node could schedule a pod, the scheduler priorities the nodes and choose the best one. If they have the same priority it chooses in round-robin fashion.
Label nodes
kubectl label node worker1.acme.com share-type=dedicated
YAML for the deployment to include the node affinity rules:
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: pref spec: replicas: 5 template: metadata: labels: app: pref spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: #all pods,but not current pod on the node - weight: 80 preference: matchExpressions: - key: availability-zone operator: In values: - zone1 - weight: 20 #4 time less priority then AZ preference: matchExpressions: - key: share-type #label key operator: In values: - dedicated #label value containers: - args: - sleep - "999" image: busybox:v1.28.4 name: main
Cluster nodes capacity and resources
Check node's capacity
kubectl describe nodes worker-2.acme.com | grep -A 20 Capacity:
<syntaxhighlightjs lang=yaml>
- ...(output omitted)...
Capacity:
cpu: 2 ephemeral-storage: 20263528Ki hugepages-2Mi: 0 memory: 4044936Ki pods: 110
Allocatable:
cpu: 2 ephemeral-storage: 18674867374 hugepages-2Mi: 0 memory: 3942536Ki pods: 110
System Info:
Machine ID: ******c49b4bed31684a****** System UUID: ******-D110-CB50-EAA3-******* Boot ID: ****8-be21-45ca-b86c-311a479****** Kernel Version: 4.4.0-1087-aws OS Image: Ubuntu 16.04.6 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://18.6.1 Kubelet Version: v1.13.10 Kube-Proxy Version: v1.13.10
PodCIDR: 10.100.1.0/24 Non-terminated Pods: (3 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) 18h kube-system kube-flannel-ds-amd64-p7c7m 100m (5%) 100m (5%) 50Mi (2%) 50Mi (2%) 5d9h kube-system kube-proxy-27dbb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 100m (5%) 100m (5%) memory 50Mi (2%) 50Mi (2%) ephemeral-storage 0 (0%) 0 (0%)
- ...(output omitted)...
</syntaxhighlightjs>
Resource quotas | Resource request and limit
- resource request
- is what a pods is guaranteed to get; it's the amount of resources necessary to run a container; a pod will only be scheduled on a node that can give that resource
- resource limit
- makes sure that a container never goes above a value specified, they allow to up to the limit then are resticted
- Exceeding a memory limit makes your container process a candidate for oom-killing
- process basically can’t exceed the set cpu quota, and will never get evicted for trying to use more cpu time than allocated. The system enforces the quota at the scheduler so the process just gets throttled at the limit.
Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to:
- 1 AWS vCPU, 1 GCP Core, 1 Azure vCore, 1 IBM vCPU
- 1 Hyperthread on a bare-metal Intel processor with Hyperthreading
- CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
Resources type (compressed and not-compressed resource)
- CPU (compressed resource) - once limit has been reached Kubernetes will start throttling CPU of the container process(es). The pod won't get terminated or evicted
- Memory (not compressible resource) - once limit has been reached the pod will get terminated
Units k8s docs
- memory
64Mi
- measures in bytes, it means 64 Mebibytes - cpu
250m
- measured in cores, it means 250 miliCPUs or 0.25 CPU core
Note: 1 MiB = 220 bytes = 1048576bytes = 1024 kibibytes
Requests and limits
spec:
containers:
- name: hello-world-container
image: paulbouwer/hello-kubernetes:1.5
resources:
limits:
cpu: "0.9" # Throttling if tries to use more
memory: 512Mi # OOM kill if tries to use more
requests:
cpu: "500m" # Info for scheduling and Docker. Chances for
memory: 256Mi # eviction increase if we use more than requested.
Schedule a pod with resources request on the specific node <syntaxhighlightjs lang=yaml> apiVersion: v1 kind: Pod metadata:
name: resource-pod1
spec:
nodeSelector: kubernetes.io/hostname: "worker-2.acme.com" containers: - image: busybox command: ["dd", "if=/dev/zero", "of=/dev/null"] name: budybox-dd resources: requests: cpu: 800m # mili cores -> 2000m=20% of cpu (large for 2nd deployment) memory: 20Mi # Mb
</syntaxhighlightjs>
Create a pod and watch resource request balance changing
kubectl apply -f resource-pod1.yml
watch -d 'kubectl describe nodes worker-2.acme.com | grep -A 25 Non-terminated'
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default busybox 0 (0%) 0 (0%) 0 (0%) 0 (0%) 24h
default nginx-loadbalancer-86bb844fb7-bl5fs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
default -->resource-pod1<-- 800m (40%) 0 (0%) 20Mi (0%) 0 (0%) 6m7s
kube-system kube-flannel-ds-amd64-97hvr 100m (5%) 100m (5%) 50Mi (1%) 50Mi (1%) 14d
kube-system kube-proxy-fxl6f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14d
rbac1 test-f57db4bfd-ghshj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 900m (45%) 100m (5%) # <- resources balance
memory 70Mi (1%) 50Mi (1%)
ephemeral-storage 0 (0%) 0 (0%)
Then deploying another pod, requesting 2000mi cpus, will end up with scheduling error when describing the pod
kubectl describe pod pod2-2000mi-cpu
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 44s (x10 over 4m57s) default-scheduler 0/3 nodes are available: 2 node(s) didn't match node selector, 3 Insufficient cpu.
Limits YAML, Unlike requests, limits can go above total utilisation of CPU and memory. K8s will detect if overcommitted and kill the pod. Be aware Containers within pods are not aware of limits sets to pods, this can be seen from top
command within a container.
<syntaxhighlightjs lang=yaml>
apiVersion: v1
kind: Pod
metadata:
name: pod-limit-resources
spec:
containers: - image: busybox command: ["dd", "if=/dev/zero", "of=/dev/null"] name: main resources: limits: cpu: 2 #by default requests are eq limits if not specified memory: 40Mi #
</syntaxhighlightjs>
QoS
- BestEffort - If no requests and nolimits are setneitherfor memorynorforCPU, innoneof thecontainers.
- Guaranteed - If limits == requestsare set forbothmemory and CPU onallcontainers.(Requests defaults tolimits, so it’s enough toset limits.)
- Burstable - In all other cases.E.g.: limits aredifferent thanrequests for somecontainer or limitsare simply unset foran other.
The QoS prioritization of a pod can be debugged with
kubectl describe pod xxx | grep QoS
OOM scores are adjusted by Kubernetes, so that the system evictsunruly pods in the following order:
- BestEffort
- Burstable
- Guaranteed
- Kubelet, Docker
Kubernetes pod to node scheduling
Tains and tolerations
Tains and tolerations are restrictions to what PODs can be scheduled on nodes. By default Pods have no tolerations set up.
Taint effects (what happen to intolerant pods, PODs with a no toleration in .spec.toleratins)
- NoSchedule - it simply won't be scheduled
- PreferNoSchedule - the system will try to avoid placing a pod but it's not warrantied
- NoExecute - won't schedule and already running pods will be evicted if don't tolerate the taint, applies to pods that have been scheduled before taint was applied to a node
# Add taint on a node
kubectl taint nodes <node-name> <key>=<value>:<taint-effect>
kubectl taint nodes node1 type=workers:NoSchedule
# Add toleration to a pod
apiVersion:
kind: Pod
metadata:
name: worker-pod
spec:
containers:
- name: nginx
image: nginx
tolerations:
- key: type
operator: Equal
value: workers
effect: NoSchedule
nodeAffinity
and podAntiAffinity
This scenario tries to schedule (prefer) on nodes with label "node.kubernetes.io/lifecycle=normal" but also schedule some pods on other nodes (podAntiAffinity) even if nodeAffinity does not match. This is to avoid that all pods would run on the same node (eg. only one node matching nodeAffinity) and other nodes are available.
kubectl apply -f <(cat <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: nginx-deployment-normal-10-podantiaffinity labels: app: nginx spec: strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 25% maxSurge: 75% # it was 25% replicas: 10 selector: matchLabels: app: nginx template: metadata: labels: app: nginx version: "1.6" spec: containers: - name: nginx image: nginx:1.14.2 ports: - containerPort: 80 affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: node.kubernetes.io/lifecycle operator: In values: - normal podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app # security operator: In values: - nginx topologyKey: node.kubernetes.io/lifecycle # topology.kubernetes.io/zone EOF ) --dry-run=server
- Testing and results
With the setup above nodeAffinity:preferredDuringSchedulingIgnoredDuringExecution on `spot` and podAntiAffinity:preferredDuringSchedulingIgnoredDuringExecution if label ` nginx` match, with replicas of 5 A: nodes in setup: 2x`normal` and 1x`spot` scheduler has created 4 pods on `spot` nodeAffinity(matching) and 1 pod on node `normal` podAntiAffinity(match), this is what I'd expect.
Similar scenario 12 nodes with deployment of 10x replicas, which 2x nodes have the prefered labels and 10x other nodes;
A: 6x pods have scheduled on the preferred nodes (3pods each node), and 4 on other nodes
Summary
Resource Type | Use Cases | Pros | Cons | Best Practices |
---|---|---|---|---|
nodeSelector | Assigning pods to nodes with specific labels | Easy to use, small changes to the PodSpec | Does not support logical operators,hard to extend with complex scheduling rules | This resource should be used only in the early versions of K8s before introduction of node affinity |
Node affinity | Implementing data locality, running pods on nodes with dedicated software | Expressive syntax with logical operators,fine-grained control over pod placement rules,support for “hard” and “soft” pod placement rules | Requires modification of existing pods to change behavior | Use a combination of “hard” and “soft” rules to cover different use cases and scenarios |
Inter-pod affinity | Colocation of pods in the co-dependent service,enabling data locality | The same as for node affinity | Requires modification of existing pods to change behavior | Proper pod label management and documentation of labels used |
Pod anti-affinity | Enabling high availability (via pod distribution),preventing inter-service competition for resources | Fine-grained control over inter-pod repel behavior,support for hard and soft pod anti-affinity rules | Requires modification of existing pods to change behavior | Similar to node affinity |
Taints and tolerations | Nodes with dedicated software,separation of team resources, etc. | Does not require modification of existing pods,supports automatic eviction of pods without required toleration,supports different taint effects | Does not support expressive syntax using logical operators | Be careful when applying multiple taints to the node,ensure that the pods you need have required tolerations |
Deamonset
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created. DaemonSets do not use a scheduler to deploy pods, DS used to ignore nodes taints. Some typical uses of a DaemonSet are:
- running a cluster storage daemon, such as glusterd, ceph, on each node.
- running a logs collection daemon on every node, such as fluentd or logstash.
- running a node monitoring daemon on every node, such as Prometheus Node Exporter, collectd
Monitor events and logs
kubectl get events --all-namespaces
kubectl get events --watch #short -w
#See scheduler logs
kubectl logs [kube_scheduler_pod_name] -n kube-system
tail -f /var/log/kube-scheduler.log #run on control plane node