Intro
Welcome to part 2 of our 'How to scale your Kubernetes Clusters' blog series. If you missed part 1 where we walked through how to use the Horizontal Pod Autoscaler to scale your Kubernetes Clusters, you can check it out here . In part 2, we're going to be covering how to use the Cluster Autoscaler to dynamically scale your clusters as your resource requirements change over time.
But for a quick reminder, why is even important? Well one of the most compelling reasons to use Kubernetes is that it can help developers and devops teams effortlessly scale applications to handle variable workloads and millions of requests. This typically means being able to quickly and automatically scale up and scale down your clusters and pods to handle these changes.
Okay, now that we're on the same page about why we want to do this, let's dive in.
What is the Kubernetes Cluster Autoscaler?
The Kubernetes Cluster Autoscaler is a standalone program that automatically adds or removes nodes in a cluster based on resource requests and requirements from pods. It checks to see if there are any pods that failed to schedule on a node due to insufficient resources. If so, it will spin up another node to allow the pod(s) to be scheduled onto a node group. It continues to check if the nodes are needed (based on utilization) and then if the extra nodes are no longer needed, it will remove the underlying instance that the node uses. It doesn't delete the Node object from Kubernetes, only removes the underlying instance.
Let's take this step by step:
- Once the cluster auto-scaler is deployed into Kubernetes (usually as a deployment or daemonSet), it will check for pods that need to be scheduled, usually every 10 seconds but this is configurable using the
--scan-interval
flag. This "checking" behavior is different from the Horizontal Pod Autoscaler because it doesn't rely on metrics. The cluster autoscaler doesn't look at metrics to determine scale up/down behavior, instead it creates a watch on the Kubernetes API server to see if there are pods that haven't been scheduled. - If it sees that there are pods that need to be scheduled and the cluster needs more resources it will ask the cloud provider (AWS, GCP, Azure) to launch another node. Depending on your cloud provider, you can set rules or settings to be able to automatically add and remove virtual machines. For example, in AWS, this would be
Auto Scaling Groups
. - Kubernetes would then register this node and make it available to the Kubernetes scheduler to assign pods to it.
- Once the node is ready, the scheduler would then assign the pending pods to the node.
- Lastly, once the node is no longer being utilized, the cluster autoscaler will remove the underlying instance and not allow pods to be scheduled on that node any longer.
At a high level, the cluster autoscaler checks to see if any pods need to be schedule and then spins up a new node to schedule that pod. The nice thing is that all of this happens automatically so as a team, once you've set and configured the autoscaler it can be pretty hands off.
How does scaling up actually work?
We briefly mentioned it above where the the cluster autoscaler checks to see if any pods need to be scheduled and then spin up another node accordingly but lets go into a little more detail.
- A watch is created on the Kubernetes API server which checks for all of the pods. It checks for any unscheduled pods every 10 seconds. As mentioned, this is the default and can be adjusted by setting the
--scan-interval
flag. - A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate it. One reason this might happen is that a pod requests more CPU than is available on a node. So the scheduler would set the
schedulable
PodCondition to false and the reason tounschedulable
. This is what triggers the cluster autoscaler to spin up a new node. - The cluster autoscaler assumes that all machines within a nodeGroup have the same capacity and instance types. Just as an aside, a nodeGroup is simply what it sounds like: a group of identical nodes. Now, the cluster autoscaler creates template nodes for each of the nodeGroups and checks to see if any of the unschedulable pods would fit on a new node.
- Lastly, the cluster autoscaler would then ask the cloud provider for another virtual machine of the same instance type as the nodeGroup that it is placing that node into.
Usually we see this whole process take only a few minutes if not faster but it can take up to 30 minutes depending on how quickly your cloud provider can get a new instance type ready to be scheduled as a node. Lastly, you can set nodeGroup size limits to prevent the cluster autoscaler from scaling up too many nodes but there is the risk that you may then leave pods in a pending state until they can be scheduled (which may impact performance).
How does scaling down actually work?
Okay, so we're pretty clear on the scaling up. How about scaling down? Let's go through it:
- The cluster autoscaler uses the
--scan-interval
flag to also check for nodes that need to be scaled down. So every 10 seconds (by default) it will check to see if there are any nodes that qualify to be scaled down. What does it mean to qualify? Well there are two conditions: 1. The sum of the CPU and memory requests of all the pods running on a node is less than 50% of the node's allocatable (this is of course configurable using the--scale-down-utilization-threshold
flag) 2. All pods running on the node can be scheduled to another node, and 3. It doesn't have a scale-down disabled annotation such as"cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
. - Once both of those conditions are satisfied, the cluster autoscaler will then migrate those pods to another node that has sufficient capacity for them (or potentially multiple nodes).
- Once the migration is complete, the cluster autoscaler will then terminate the instance type for that node and no more pods will be able to scheduled on that node.
By default, the cluster autoscaler won't scale down any nodes that have "."
local storage, have the "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
annotation set or have kube-system pods running. Also, any nodes that are unneeded for more than 10 minutes will be terminated (of course, configurable).
Installing the Cluster Autoscaler
Let's walk through how to install the cluster autoscaler into your cluster. We're going to use AWS as our cloud provider.
First some prerequisites:
- You should have a running Kubernetes cluster.
- You should have the necessary credentials and permissions to deploy resources in your cluster. This typically means having an IAM Role with the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"autoscaling:DescribeTags",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeImages",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": ["*"]
}
]
}
Lastly, create an OIDC provider. You can follow this guide to easily create an OIDC provider and authorize the cluster autoscaler to spin up and down nodes.
Let's now take a look at a simple cluster autoscaler manifest with just one autoscaling group and how to configure and install it. For more details you can check out the official Github repo here .
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: ['']
resources: ['events', 'endpoints']
verbs: ['create', 'patch']
- apiGroups: ['']
resources: ['pods/eviction']
verbs: ['create']
- apiGroups: ['']
resources: ['pods/status']
verbs: ['update']
- apiGroups: ['']
resources: ['endpoints']
resourceNames: ['cluster-autoscaler']
verbs: ['get', 'update']
- apiGroups: ['']
resources: ['nodes']
verbs: ['watch', 'list', 'get', 'update']
- apiGroups: ['']
resources:
- 'namespaces'
- 'pods'
- 'services'
- 'replicationcontrollers'
- 'persistentvolumeclaims'
- 'persistentvolumes'
verbs: ['watch', 'list', 'get']
- apiGroups: ['extensions']
resources: ['replicasets', 'daemonsets']
verbs: ['watch', 'list', 'get']
- apiGroups: ['policy']
resources: ['poddisruptionbudgets']
verbs: ['watch', 'list']
- apiGroups: ['apps']
resources: ['statefulsets', 'replicasets', 'daemonsets']
verbs: ['watch', 'list', 'get']
- apiGroups: ['storage.k8s.io']
resources:
['storageclasses', 'csinodes', 'csidrivers', 'csistoragecapacities']
verbs: ['watch', 'list', 'get']
- apiGroups: ['batch', 'extensions']
resources: ['jobs']
verbs: ['get', 'list', 'watch', 'patch']
- apiGroups: ['coordination.k8s.io']
resources: ['leases']
verbs: ['create']
- apiGroups: ['coordination.k8s.io']
resourceNames: ['cluster-autoscaler']
resources: ['leases']
verbs: ['get', 'update']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: ['']
resources: ['configmaps']
verbs: ['create', 'list', 'watch']
- apiGroups: ['']
resources: ['configmaps']
resourceNames:
['cluster-autoscaler-status', 'cluster-autoscaler-priority-expander']
verbs: ['delete', 'get', 'update', 'watch']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '8085'
spec:
priorityClassName: system-cluster-critical
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
seccompProfile:
type: RuntimeDefault
serviceAccountName: cluster-autoscaler
containers:
- image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.20.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=1:10:k8s-worker-asg-1
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt # /etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
readOnly: true
imagePullPolicy: 'Always'
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
volumes:
- name: ssl-certs
hostPath:
path: '/etc/ssl/certs/ca-bundle.crt'
A couple of things to point out here. The first is the image
tag in - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0
. Make sure that the EKS version that you're using matches the tag here and if not, update the tag. Second is the --node-group-auto-discovery =
, this tells the cluster autoscaler to auto discover the scaling groups based on the tag. You can set the tag as a list of comma separated values. Lastly, is the '--skip-nodes-with-systems-pods
, make sure this is set to true
. This tells the cluster autoscaler to never delete pods associated with kube-system.
Once we're good with this, let's save it to a file called cluster-autoscaler.yaml
.
Cool, let's install the cluster autoscaler. We can do this by applying our cluster-autoscaler.yaml
config file from above using the following kubectl
command:
kubectl apply -f cluster-autoscaler.yaml
And that's it. We've successfully configured and installed the Kubernetes Cluster Autoscaler!
Cluster Autoscaling on Nucleus
Nucleus automates autoscaling for teams who want a more hands off or automated approach towards scaling their infrastructure. By default, when you deploy a Nucleus Environment, cluster autoscaler already comes preinstalled and configured with all of the necessary roles, rolebindings and IAM roles it needs. All you have to do is configure the min and max nodes that you'd like Nucleus to maintain. Here's what it looks like in the Nucleus dashboard:
All you have to do is set the min and max number of nodes that you want the cluster to scale to and let Nucleus take care of the rest. Easy as that.
Wrapping up
In this blog, we've talked about what Cluster Autoscaler is and how it can help teams automatically scale cluster nodes to ensure that your pods are always scheduled. We've also seen how to implement this without using Nucleus and with Nucleus.
In the next part of this series, we'll take a look at Vertical Pod Autoscaler and how you can scale the resources within a pod.
Until then!