Kubernetes/Resources and Limits
Pod CPU Throttling
- https://stackoverflow.com/questions/54099425/pod-cpu-throttling - Pod CPU Throttling
- https://github.com/kubernetes/kubernetes/issues/67577 - CFS quotas can lead to unnecessary throttling · Issue #67577 · kubernetes/kubernetes
- https://github.com/kubernetes/kubernetes/issues/51135#issuecomment-373454012 - Avoid setting CPU limits for Guaranteed pods · Issue #51135 · kubernetes/kubernetes
- https://github.com/libero/reviewer/issues/1023 - Default CPU limit leads to pods getting throttled · Issue #1023 · libero/reviewer
- https://medium.com/omio-engineering/cpu-limits-and-aggressive-throttling-in-kubernetes-c5b20bd8a718 - CPU limits and aggressive throttling in Kubernetes
From briefly perusing through the above links and others, there's a few conclusions I've come to:
- CPU limits are more complicated and nuanced than memory limits under the hood
- It seems that there was a CFS (Completely Fair Scheduler) bug in the Linux kernel. These posts describe in more detail:
- https://engineering.indeedblog.com/blog/2019/12/unthrottled-fixing-cpu-limits-in-the-cloud/ - CPU Throttling - Unthrottled: Fixing CPU Limits in the Cloud
- https://engineering.indeedblog.com/blog/2019/12/cpu-throttling-regression-fix/ - CPU Throttling - Unthrottled: How a Valid Fix Becomes a Regression
This has supposedly been patched into the EKS nodes
- https://github.com/aws/containers-roadmap/issues/175 - Use kernel 4.18 in EKS and ECS Amazon Linux AMIs to solve CFS throttling issues. · Issue #175 · aws/containers-roadmap
- Fix is in Amazon Linux 2 with Kernel 4.14.154
- We're running Kernel 4.14.232+
However, even with this fix, it does seem like unexpected CPU throttling is still a common issue. e.g.
- https://github.com/kubernetes/kubernetes/issues/97445 -CPU Throttling on Linux kernel 5.4.0-1029-aws · Issue #97445 · kubernetes/kubernetes
This article led to this ycombinator with quite a varied range of opinions about if CPU limits including this interesting one:
<quote>
The core principle most readers miss is that CPU limits are tied to CPU throttling, which is markedly different than CPU time sharing. I would argue that in 99% of cases, you truly do not need or want limits.
limits cause CPU throttling, which is like running your process in a strobe light. If your quota period is 100ms, you might only be able to make progress for 10ms out of every 100ms period, regardless of whether or not there is CPU contention, just because you've exceeded your limit.
requests -> CFS time sharing. This ensures that out of a given period of time, CPU time is scheduled fairly and according to the request as a proportion of total request (it just so happens that the Kube scheduler won't schedule such that sum[requests] > capacity, but theoretically it could because requests are truly relative when it comes to how they are represented in cgroups) </quote>
Detect pods without resources set
kubectl get po -n csi-drivers -oyaml | grep -e '^ resources: {}' -e "^ name:" | grep '^ resources: {}' -B1