Kubernetes pod priorities, and the afternoon I lost to them
Priority classes are simple until they aren't. A short story and a checklist so the next person doesn't repeat it.
I’m writing this so I never have to repeat the afternoon I just had. We had a node go unhealthy in a non-prod cluster, the scheduler did its job, and within a minute roughly half of our monitoring stack had been evicted by a CronJob.
The CronJob was, technically, behaving correctly.
What happened
The CronJob ran with the cluster’s default priority class — which, on this
cluster, was system-cluster-critical because someone had set it as the
default a year ago and nobody had revisited it. When the unhealthy node went
away, the scheduler found a tight fit on the remaining nodes and started
preempting everything cheaper to make room for the CronJob’s pods.
Most of our observability pods had no priorityClassName at all, so they
defaulted to globalDefault: true’s value (zero, in their case). They lost.
What “priority” actually does
Two things, both important:
- Scheduling order. Higher-priority pending pods are placed before lower-priority pending pods.
- Preemption. A pending high-priority pod can evict lower-priority running pods to make room for itself.
Number 2 is the one that gets you. It’s a verb, and it happens whenever the scheduler can’t find a node that fits.
A reasonable layering
Here’s the layering we now use. Nothing exotic — but explicitly defined and applied via Gatekeeper / Kyverno so nothing slips through with no class.
# 0 — the floor. Best-effort batch jobs.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 100
globalDefault: false
description: "Best-effort batch work. First to be preempted."
---
# 1 — normal app workloads.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: app-default
value: 1000
globalDefault: true
description: "Default for application workloads."
---
# 2 — production-critical apps that should preempt batch but not platform.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: app-critical
value: 5000
globalDefault: false
description: "Tier-0 customer-facing services."
---
# 3 — platform / observability / ingress / DNS.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: platform-critical
value: 10000
globalDefault: false
description: "Cluster platform components. Preempt almost anything below."
The two stock classes (system-cluster-critical and system-node-critical)
sit above all of these and are reserved for actual cluster-bring-up
components.
The checklist
After all this, we now run through the same five questions whenever someone adds a new workload or asks about scheduling weirdness:
- Does every workload have an explicit
priorityClassName? - Is exactly one class marked
globalDefault: true, and is it the least surprising one? - Are batch jobs at a priority below application pods?
- Are observability and ingress at a priority above application pods?
- Have we set
preemptionPolicy: Neveron workloads that should queue instead of evicting others (e.g. low-priority backfills)?
If you can’t answer “yes” to all five, your cluster has a quiet preemption bug waiting for the wrong moment.
What I’d skip if I had the afternoon back
- Don’t rely on
globalDefault. It’s a footgun the moment a junior engineer (or, you know, me) flips it to a critical class. - Don’t assume “high priority” means “important.” It means “willing to evict.” Those are different.
- Do write the priority class into your Helm chart values as a required field. If the chart can’t be installed without naming a class, nobody forgets.
That’s it. The cluster is fine, the dashboards are back, and I have a fresh appreciation for explicitly defaulting things.