Nucleus Logo

Nucleus

How to scale your Kubernetes Clusters Part 2: Cluster Autoscaler

Evis Drenova

|

|

7min

single cube

Intro

Welcome to part 2 of our 'How to scale your Kubernetes Clusters' blog series. If you missed part 1 where we walked through how to use the Horizontal Pod Autoscaler to scale your Kubernetes Clusters, you can check it out here . In part 2, we're going to be covering how to use the Cluster Autoscaler to dynamically scale your clusters as your resource requirements change over time.

But for a quick reminder, why is even important? Well one of the most compelling reasons to use Kubernetes is that it can help developers and devops teams effortlessly scale applications to handle variable workloads and millions of requests. This typically means being able to quickly and automatically scale up and scale down your clusters and pods to handle these changes.

Okay, now that we're on the same page about why we want to do this, let's dive in.

What is the Kubernetes Cluster Autoscaler?

The Kubernetes Cluster Autoscaler is a standalone program that automatically adds or removes nodes in a cluster based on resource requests and requirements from pods. It checks to see if there are any pods that failed to schedule on a node due to insufficient resources. If so, it will spin up another node to allow the pod(s) to be scheduled onto a node group. It continues to check if the nodes are needed (based on utilization) and then if the extra nodes are no longer needed, it will remove the underlying instance that the node uses. It doesn't delete the Node object from Kubernetes, only removes the underlying instance.

ca_works

Let's take this step by step:

  1. Once the cluster auto-scaler is deployed into Kubernetes (usually as a deployment or daemonSet), it will check for pods that need to be scheduled, usually every 10 seconds but this is configurable using the --scan-interval flag. This "checking" behavior is different from the Horizontal Pod Autoscaler because it doesn't rely on metrics. The cluster autoscaler doesn't look at metrics to determine scale up/down behavior, instead it creates a watch on the Kubernetes API server to see if there are pods that haven't been scheduled.
  2. If it sees that there are pods that need to be scheduled and the cluster needs more resources it will ask the cloud provider (AWS, GCP, Azure) to launch another node. Depending on your cloud provider, you can set rules or settings to be able to automatically add and remove virtual machines. For example, in AWS, this would be Auto Scaling Groups.
  3. Kubernetes would then register this node and make it available to the Kubernetes scheduler to assign pods to it.
  4. Once the node is ready, the scheduler would then assign the pending pods to the node.
  5. Lastly, once the node is no longer being utilized, the cluster autoscaler will remove the underlying instance and not allow pods to be scheduled on that node any longer.

At a high level, the cluster autoscaler checks to see if any pods need to be schedule and then spins up a new node to schedule that pod. The nice thing is that all of this happens automatically so as a team, once you've set and configured the autoscaler it can be pretty hands off.

How does scaling up actually work?

We briefly mentioned it above where the the cluster autoscaler checks to see if any pods need to be scheduled and then spin up another node accordingly but lets go into a little more detail.

  1. A watch is created on the Kubernetes API server which checks for all of the pods. It checks for any unscheduled pods every 10 seconds. As mentioned, this is the default and can be adjusted by setting the --scan-interval flag.
  2. A pod is unschedulable when the Kubernetes scheduler is unable to find a node that can accommodate it. One reason this might happen is that a pod requests more CPU than is available on a node. So the scheduler would set the schedulable PodCondition to false and the reason to unschedulable. This is what triggers the cluster autoscaler to spin up a new node.
  3. The cluster autoscaler assumes that all machines within a nodeGroup have the same capacity and instance types. Just as an aside, a nodeGroup is simply what it sounds like: a group of identical nodes. Now, the cluster autoscaler creates template nodes for each of the nodeGroups and checks to see if any of the unschedulable pods would fit on a new node.
  4. Lastly, the cluster autoscaler would then ask the cloud provider for another virtual machine of the same instance type as the nodeGroup that it is placing that node into.

Usually we see this whole process take only a few minutes if not faster but it can take up to 30 minutes depending on how quickly your cloud provider can get a new instance type ready to be scheduled as a node. Lastly, you can set nodeGroup size limits to prevent the cluster autoscaler from scaling up too many nodes but there is the risk that you may then leave pods in a pending state until they can be scheduled (which may impact performance).

How does scaling down actually work?

Okay, so we're pretty clear on the scaling up. How about scaling down? Let's go through it:

  1. The cluster autoscaler uses the --scan-interval flag to also check for nodes that need to be scaled down. So every 10 seconds (by default) it will check to see if there are any nodes that qualify to be scaled down. What does it mean to qualify? Well there are two conditions: 1. The sum of the CPU and memory requests of all the pods running on a node is less than 50% of the node's allocatable (this is of course configurable using the --scale-down-utilization-threshold flag) 2. All pods running on the node can be scheduled to another node, and 3. It doesn't have a scale-down disabled annotation such as "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true".
  2. Once both of those conditions are satisfied, the cluster autoscaler will then migrate those pods to another node that has sufficient capacity for them (or potentially multiple nodes).
  3. Once the migration is complete, the cluster autoscaler will then terminate the instance type for that node and no more pods will be able to scheduled on that node.

By default, the cluster autoscaler won't scale down any nodes that have "." local storage, have the "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" annotation set or have kube-system pods running. Also, any nodes that are unneeded for more than 10 minutes will be terminated (of course, configurable).

Installing the Cluster Autoscaler

Let's walk through how to install the cluster autoscaler into your cluster. We're going to use AWS as our cloud provider.

First some prerequisites:

  1. You should have a running Kubernetes cluster.
  2. You should have the necessary credentials and permissions to deploy resources in your cluster. This typically means having an IAM Role with the following policy:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "autoscaling:DescribeAutoScalingInstances",
        "autoscaling:DescribeLaunchConfigurations",
        "autoscaling:DescribeScalingActivities",
        "autoscaling:DescribeTags",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeLaunchTemplateVersions"
      ],
      "Resource": ["*"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup",
        "ec2:DescribeImages",
        "ec2:GetInstanceTypesFromInstanceRequirements",
        "eks:DescribeNodegroup"
      ],
      "Resource": ["*"]
    }
  ]
}

Lastly, create an OIDC provider. You can follow this guide to easily create an OIDC provider and authorize the cluster autoscaler to spin up and down nodes.

Let's now take a look at a simple cluster autoscaler manifest with just one autoscaling group and how to configure and install it. For more details you can check out the official Github repo here .

---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: ['']
    resources: ['events', 'endpoints']
    verbs: ['create', 'patch']
  - apiGroups: ['']
    resources: ['pods/eviction']
    verbs: ['create']
  - apiGroups: ['']
    resources: ['pods/status']
    verbs: ['update']
  - apiGroups: ['']
    resources: ['endpoints']
    resourceNames: ['cluster-autoscaler']
    verbs: ['get', 'update']
  - apiGroups: ['']
    resources: ['nodes']
    verbs: ['watch', 'list', 'get', 'update']
  - apiGroups: ['']
    resources:
      - 'namespaces'
      - 'pods'
      - 'services'
      - 'replicationcontrollers'
      - 'persistentvolumeclaims'
      - 'persistentvolumes'
    verbs: ['watch', 'list', 'get']
  - apiGroups: ['extensions']
    resources: ['replicasets', 'daemonsets']
    verbs: ['watch', 'list', 'get']
  - apiGroups: ['policy']
    resources: ['poddisruptionbudgets']
    verbs: ['watch', 'list']
  - apiGroups: ['apps']
    resources: ['statefulsets', 'replicasets', 'daemonsets']
    verbs: ['watch', 'list', 'get']
  - apiGroups: ['storage.k8s.io']
    resources:
      ['storageclasses', 'csinodes', 'csidrivers', 'csistoragecapacities']
    verbs: ['watch', 'list', 'get']
  - apiGroups: ['batch', 'extensions']
    resources: ['jobs']
    verbs: ['get', 'list', 'watch', 'patch']
  - apiGroups: ['coordination.k8s.io']
    resources: ['leases']
    verbs: ['create']
  - apiGroups: ['coordination.k8s.io']
    resourceNames: ['cluster-autoscaler']
    resources: ['leases']
    verbs: ['get', 'update']
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: ['']
    resources: ['configmaps']
    verbs: ['create', 'list', 'watch']
  - apiGroups: ['']
    resources: ['configmaps']
    resourceNames:
      ['cluster-autoscaler-status', 'cluster-autoscaler-priority-expander']
    verbs: ['delete', 'get', 'update', 'watch']

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8085'
    spec:
      priorityClassName: system-cluster-critical
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccountName: cluster-autoscaler
      containers:
        - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.20.0
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 600Mi
            requests:
              cpu: 100m
              memory: 600Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --nodes=1:10:k8s-worker-asg-1
          volumeMounts:
            - name: ssl-certs
              mountPath: /etc/ssl/certs/ca-certificates.crt # /etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
              readOnly: true
          imagePullPolicy: 'Always'
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
      volumes:
        - name: ssl-certs
          hostPath:
            path: '/etc/ssl/certs/ca-bundle.crt'

A couple of things to point out here. The first is the image tag in - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.20.0. Make sure that the EKS version that you're using matches the tag here and if not, update the tag. Second is the --node-group-auto-discovery =, this tells the cluster autoscaler to auto discover the scaling groups based on the tag. You can set the tag as a list of comma separated values. Lastly, is the '--skip-nodes-with-systems-pods, make sure this is set to true. This tells the cluster autoscaler to never delete pods associated with kube-system.

Once we're good with this, let's save it to a file called cluster-autoscaler.yaml.

Cool, let's install the cluster autoscaler. We can do this by applying our cluster-autoscaler.yaml config file from above using the following kubectl command:

kubectl apply -f cluster-autoscaler.yaml

And that's it. We've successfully configured and installed the Kubernetes Cluster Autoscaler!

Cluster Autoscaling on Nucleus

Nucleus automates autoscaling for teams who want a more hands off or automated approach towards scaling their infrastructure. By default, when you deploy a Nucleus Environment, cluster autoscaler already comes preinstalled and configured with all of the necessary roles, rolebindings and IAM roles it needs. All you have to do is configure the min and max nodes that you'd like Nucleus to maintain. Here's what it looks like in the Nucleus dashboard:

ca_nucleus

All you have to do is set the min and max number of nodes that you want the cluster to scale to and let Nucleus take care of the rest. Easy as that.

Wrapping up

In this blog, we've talked about what Cluster Autoscaler is and how it can help teams automatically scale cluster nodes to ensure that your pods are always scheduled. We've also seen how to implement this without using Nucleus and with Nucleus.

In the next part of this series, we'll take a look at Vertical Pod Autoscaler and how you can scale the resources within a pod.

Until then!

Table of Contents

  • Intro
  • What is the Kubernetes Cluster Autoscaler?
  • How does scaling up actually work?
  • How does scaling down actually work?
  • Installing the Cluster Autoscaler
  • Cluster Autoscaling on Nucleus
  • Wrapping up

Latest Articles

blog_image

Product

3 types of Zero-Downtime Deployments in Kubernetes

A guide to the 3 types of zero-downtime deployments in Kubernetes

|

|

5min

Subscribe to new blogs from Nucleus.