Pods in Kubernetes run on cluster which is comprised from different nodes. A cluster may have many different deployments, replica-sets, jobs and different type of nodes running them. In many cases you want to run some of the pods on specific nodes, using specific resources available on these nodes like GPUs, larger disk space, extra memory or specific operating system configuration.

Moreover, you might want to run only these pods on the nodes, and restrict all other deployments from running on these nodes. A common use-case is to prevent resource starvation.

For example: Suppose you have a web application that serves API to customers. It runs on a single cluster with other deployments. Some are batch operations, some are scheduled operations and some are different services. You want the web application to scale out well and always have appropriate resources (without re-allocation to different nodes - a process that takes time and is problematic under extreme load). In our case, we created nodes customized to our web application for high throughput and didn't want other deployments and jobs to run on these specialized nodes, to avoid starvation of the web app (availability and scale), and also because they might not fit well with the profile of the other deployments.

Node selector

Kubernetes allows to assign pods to nodes using node selector. Node selector allows a pod spec to select the node they run on, according to labels.

A simple example is to select nodes with SSD disks:

apiVersion: v1
kind: Deployment
metadata:
  name: my-web-app
spec: 
  nodeSelector:
    disktype: ssd

A more complex example: In GCP, nodes are bound together with node-pools. Each node in the node pool has a label that contain the name of the node pool. So if you have a node pool with node that have GPUs named dedicated-gpu-pool and you want to assign a deployment to this pool, add the following in the pod specification:

apiVersion: v1
kind: Deployment
metadata:
  name: my-web-app
  namespace: webapps
spec:
  nodeSelector:
    cloud.google.com/gke-nodepool: dedicated-gpu-pool

Note that cloud.google.com/gke-nodepool is just a label. You can select nodes by any label you want.

All done, right?

Not quite. The deployment we created will indeed run on the nodes selected, but other deployments and jobs can run on it as well. Kubernetes scheduling allows to schedule pods on all available nodes - so if you have a cluster that runs many pods, they will also use the nodes you selected, but the specific deployment you modified will only run on the selected nodes. A likely result is resource starvation fo our deployment - since it can only run on these nodes.

Option 1: all deployments and jobs should select nodes appropriate to them

The idea is to modify all deployments with node selector, so each pod will run on the nodes relevant to it.

It sounds good in theory, but won't hold in real world scenarios. When there's a single cluster that serves the entire org (or even part of it) with many applications, services and jobs, creating and selecting nodes for each one kinda racks the concept that K8s is built upon - the abstraction between application and infrastructure - and requires a lot of governance and limits that are hard to enforce.

Also, in an existing cluster, will require quite a bit of work to modify all the pod specs.

Option 2: restrict pods from running on specific nodes

The idea is to opt-out from nodes on each deployment. This can be achieved by using Anti Affinity. I won't go into details, but similar to node selector, you can define operators that prevent pods from being scheduled on specific nodes according to labels.

While anti affinity will solve the problem, its not very maintainable. First, every time you create a new deployment you need to remember to restrict the relevant nodes. This can be done by helm charts for example, but is still error prone - since you don't have to use these - and you don't want to risk production environment.

Second, every time you create specific nodes for a deployment that requires the described behavior (i.e. allow only this deployment to run on the nodes), you'll need to modify all the previous deployments and jobs to opt-out from this pool. It is still feasible - by labeling all these special nodes for example, but require good governance on the environment.

What we really want to do is to allow opt-in for a node (or group of nodes) according to some kind of rule. Then a pod can simply opt-in for scheduling, rather then opt-out.

That's where node taints come in

Node taints

Node taints is a property you can set on nodes (and in GCP on node pools), and allow pods to opt-in to it in order to be schedule on it.

On the node you define a taint - a key-value-effect trinity, and on the pod spec you define tolerations - composition of the same trinity. When the scheduler needs to schedule a pod it will check if the pod can tolerate the taint on the target nodes. If so it will be scheduled, and if not it will be scheduled on a different node.

You can add taints on nodes using kubectl:

$ kubectl taint nodes my-node dedicated=special-web-app:NoSchedule

Note that on GCP, when you can only add taints on a node pool upon creating it - you can't add or update these later:

$ gcloud container node-pools create dedicated-web-app-pool --num-nodes=4 --machine-type=n2-standard-4 --image-type=cos --preemptible --disk-type=pd-ssd --disk-size=100GB --zone us-central1-a --cluster my-cluster --node-taints dedicated=special-web-app:NoSchedule

There are a couple of different effects you can specify. Specifically here, I used NoSchedule to prevent scheduling in the first place. For more options please refer to the docs.

In the pod spec, define the following:

apiVersion: v1
kind: Deployment
metadata:
  name: my-web-app
  namespace: webapps
spec:  
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special-web-app"
    effect: "NoSchedule"

This will indicate to the scheduler that this pod can tolerate dedicated:special-web-app taint - and will be able to schedule it on the nodes we defined earlier.

Putting it all together

Node taints are a hard restriction on the node side - i.e. a pod can only be scheduled on a node if it tolerates its taints.

Pod toleration is a soft restriction - the node does not have to have the specific taint, but if it does, the tolerations will be validated. That means that if you have nodes that don't have taints at all, the pod will be able to be schedule on them.

That's why you need both node selector and node taints in order to achieve a complete restriction of node-pod symbiote:

apiVersion: v1
kind: Deployment
metadata:
  name: my-web-app
  namespace: webapps
spec:
  nodeSelector:
    cloud.google.com/gke-nodepool: dedicated-web-app-pool
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "special-web-app"
    effect: "NoSchedule"

Summary

Kubernetes have a complex and flexible scheduling mechanism that allows us to have granular control on what we want to schedule and where. These are crucial in production and real world application since optimization are always required to run different applications.

If you are familiar with other different ways to achieve the restrictions described here, please mention them in the comments section


Photo by Michael Dziedzic on Unsplash