One of the challenges in working with distributed systems is that resources can be in different states at any point in time. Some resources take longer than others to initialize and respond to incoming requests, while others may need to be restarted because of an issue. As a developer, you want to make sure that the end user experience is flawless regardless of the state of your resources.
So how do you build a resilient system that is smart enough to understand the status of a resource and whether or not to send traffic to it? This is where Kubernetes Health Checks save the day.
What are Kubernetes Health Checks?
One of the key features of Kubernetes is its ability to perform health checks (or probes) on running containers to make sure that they are running properly and able to handle network requests. As a developer, you can configure these health checks and their parameters to make sure that Kubernetes is sending traffic to the healthy pods and not the unhealthy or unready ones.
Health checks are carried out by the kubelet to determine when to restart a container and/or to determine if a pod should receive traffic. Kubernetes gives developers a variety of ways that it can configure probes, a popular one being via an HTTP request to the container.
Why are Health Checks important?
Distributed systems can be hard to manage and run since the separate components work independently. Each part of the system will keep running even after other components have failed. This makes it difficult to see what parts of your system have failed or are in a state that can’t process incoming requests. One reason being that a resource is still initializing and isn’t ready to receive and process requests. Using health checks, you can check the container state and decide if Kubernetes should temporarily not send any requests to that container.
Let’s take a real-world example. Imagine you have a service that starts running as soon as the pod is alive (i.e. the process starts) but it needs some time to be ready to serve incoming requests because it needs to download content from an external source. By default, Kubernetes will start sending traffic as soon as the process inside the container starts. But you don’t actually want this because even though the process has started, your service still needs to finish downloading the content it needs to be functional.
This is where you can use the readiness probe (more on this in a minute) to tell Kubernetes to wait until the service is fully ready (i.e. process has started and content is downloaded) before it starts sending incoming requests to this service.
Let’s look at another example. Imagine that you have a service that hangs and stops responding to incoming requests. Because the process is still running, by default Kubernetes will still send traffic to the container. Using the liveness probes (more on this in a minute), Kubernetes will detect the service is no longer serving requests and restart the pod hopefully bringing the service back up in a running state.
In both examples, without health checks, it would be difficult to understand why requests aren’t processed and could mean that you have to manually restart pods. Health checks give you more control over how your services handle incoming requests and help you provide a more consistent user experience to your customers.
What are the different types of Health Checks?
In the section above, we briefly introduced two different kinds of health checks: readiness probe and liveness probe. There are actually three different kinds of probes that you can configure depending on your use-case.
Allows Kubernetes to check if your service is alive and working as expected. The kubelet agent that runs on each node uses the liveness probes to ensure that the containers are running as expected. If a container is no longer serving incoming requests, kubelet will intervene and restart the container. For example, if a service has a deadlock, the liveness probe can detects that it is not working as expected and the kubelet can terminate and restart the container.
Allows Kubernetes to check if your service is ready to receive incoming requests. If a readiness probe fails, Kubernetes will stop routing traffic to the pod until the probe passes again. As in the example above, is a container needs to perform some initialization tasks, the readiness probe will fail until those tasks are done and then the container will be able to receive incoming requests. A pod is considered ready when all of its containers are ready. That helps Kubernetes control which pods are used as backends for services. If not ready, a pod is removed from service load balancers.
Allows Kubernetes to check if your service has been initialized successfully and is ready to start serving incoming requests. If a startup probe fails, the pod is restarted. If a pod container takes too long to become ready, then readiness probes may fail repeatedly which puts them at risk of being restarted by the kubelet. This is where the startup probe comes to the rescue. The startup probe forces liveness and readiness checks to wait until it succeeds so that the application startup is not compromised. That is especially beneficial for slow-starting legacy services.
As a developer, you can configure one or all of the Kubernetes health checks. Any health checks that are not configured will automatically pass. If none of the health checks are configured, then the pod will automatically start receiving traffic. In practice, most developers configure the liveness and readiness probes. The startup probe is mainly used for slower legacy services that need extra time.
How do Health Checks work?
Once a health check is configured, Kubernetes will constantly use that health check to check the status of a pod. Generally, they follow the same lifecycle:
- The pod starts and the containers and their processes start.
- A health check is then invoked by the kubelet at a path specified by the developer. This is usually something like
/healthif you’re using an http probe.
- If that doesn’t pass, it checks if it hit the failure threshold (or the amount of times it can call the health check) and if so, then the probe fails. If it hasn’t hit the failure threshold then it waits and attempts the startup probe again.
- If the probe passes successfully, then the the pod either starts receiving traffic or checks another probe that the developer has configured.
How to configure Health Checks?
Nucleus makes it really easy and straightforward to configure health checks for your services without having to write yaml.
To get started, navigate to your Service Settings by clicking on your environment and then the service whose health checks you’d like to configure. Here you can find the health checks section with the readiness and liveness health checks.
Here you can easily configure your health checks and audit your service settings without having to dig through yaml files or directly query Kubernetes. Easy as that!
Health Check Best Practices
When thinking about how to configure health checks, there are a few best practices to consider:
- HTTP services shouldn’t have TCP health checks since they’ll mark the application as healthy when it binds to the port even though the HTTP service may not be running.
- Servers should always have health checks.
- Avoid verifying dependencies in your health checks that aren’t necessary for the service to be running.
- Avoid health check loops. For ex. if service A needs service B and C to be online but service B needs service A to be online, this will cause a loop and one service goes down they all will.
BONUS: Termination Grace Period
While not necessarily a health check, the termination grace period can be used as an elegant way to shutdown pods that aren’t being used anymore. Let’s take a deeper look.
Let’s take the scenario where a service has seen an influx of incoming requests and it spins up another pod to offload the traffic demands. Eventually that influx of requests dies down and it’s time for that pod to be destroyed. Because Pods represent processes running on nodes in the cluster, you don’t want that process to be killed right away with a KILL signal. Instead you want that pod to have a chance to complete any requests that are laying around and then *gracefully***** shutdown without interrupting the user-experience. This is where the termination grace period comes in.
In Nucleus you can easily set the termination grace period in the Service Settings. This will tell the kubelet to wait until the grace period is over to send a KILL signal to the pod to shut down.
The termination grace period is another level of control over your pods that Nucleus makes easy to configure.
As you build more complex services in Kubernetes, you’ll want more control over how these services run and how they fulfill incoming requests. Kubernetes health checks are a great way to provide a consistent user-experience to your customers by controlling how requests are routed to pods and containers.