Kubernetes pod priorities, and the afternoon I lost to them

Priority classes are simple until they aren't. A short story and a checklist so the next person doesn't repeat it.

I’m writing this so I never have to repeat the afternoon I just had. We had a node go unhealthy in a non-prod cluster, the scheduler did its job, and within a minute roughly half of our monitoring stack had been evicted by a CronJob.

The CronJob was, technically, behaving correctly.

What happened

The CronJob ran with the cluster’s default priority class — which, on this cluster, was system-cluster-critical because someone had set it as the default a year ago and nobody had revisited it. When the unhealthy node went away, the scheduler found a tight fit on the remaining nodes and started preempting everything cheaper to make room for the CronJob’s pods.

Most of our observability pods had no priorityClassName at all, so they defaulted to globalDefault: true’s value (zero, in their case). They lost.

What “priority” actually does

Two things, both important:

  1. Scheduling order. Higher-priority pending pods are placed before lower-priority pending pods.
  2. Preemption. A pending high-priority pod can evict lower-priority running pods to make room for itself.

Number 2 is the one that gets you. It’s a verb, and it happens whenever the scheduler can’t find a node that fits.

A reasonable layering

Here’s the layering we now use. Nothing exotic — but explicitly defined and applied via Gatekeeper / Kyverno so nothing slips through with no class.

# 0 — the floor. Best-effort batch jobs.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch-low
value: 100
globalDefault: false
description: "Best-effort batch work. First to be preempted."
---
# 1 — normal app workloads.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: app-default
value: 1000
globalDefault: true
description: "Default for application workloads."
---
# 2 — production-critical apps that should preempt batch but not platform.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: app-critical
value: 5000
globalDefault: false
description: "Tier-0 customer-facing services."
---
# 3 — platform / observability / ingress / DNS.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: platform-critical
value: 10000
globalDefault: false
description: "Cluster platform components. Preempt almost anything below."

The two stock classes (system-cluster-critical and system-node-critical) sit above all of these and are reserved for actual cluster-bring-up components.

The checklist

After all this, we now run through the same five questions whenever someone adds a new workload or asks about scheduling weirdness:

  1. Does every workload have an explicit priorityClassName?
  2. Is exactly one class marked globalDefault: true, and is it the least surprising one?
  3. Are batch jobs at a priority below application pods?
  4. Are observability and ingress at a priority above application pods?
  5. Have we set preemptionPolicy: Never on workloads that should queue instead of evicting others (e.g. low-priority backfills)?

If you can’t answer “yes” to all five, your cluster has a quiet preemption bug waiting for the wrong moment.

What I’d skip if I had the afternoon back

  • Don’t rely on globalDefault. It’s a footgun the moment a junior engineer (or, you know, me) flips it to a critical class.
  • Don’t assume “high priority” means “important.” It means “willing to evict.” Those are different.
  • Do write the priority class into your Helm chart values as a required field. If the chart can’t be installed without naming a class, nobody forgets.

That’s it. The cluster is fine, the dashboards are back, and I have a fresh appreciation for explicitly defaulting things.