Skip to content

Watcher conditions

Every alert reason emitted by each resource watcher, its default severity, and the exact condition that triggers it. Default severities are hardcoded in the watcher source and may be remapped with severityOverrides.

Reasons are the fourth component of the dedupe fingerprint (sha256(kind|namespace|name|reason)) and are the values matched by routing, severityOverrides, inhibitions, and silences reason keys (which accept an anchored regex).

Pod

Kind: Pod. Container waiting/terminated states are checked first and short- circuit; per-restart alerts run only when no waiting/OOM state matched.

Reason Default severity Trigger
CrashLoopBackOff critical A container status has state.waiting.reason == CrashLoopBackOff.
ImagePullBackOff warning A container status has state.waiting.reason == ImagePullBackOff.
ErrImagePull warning A container status has state.waiting.reason == ErrImagePull.
OOMKilled critical A container's lastTerminationState.terminated.reason == OOMKilled.
ContainerKilled warning A container's last termination was a non-OOM SIGKILL (exitCode == 137 or signal == 9) and the pod is not being deleted (metadata.deletionTimestamp unset). Catches liveness-probe escalation, terminationGracePeriodSeconds exceeded mid-run, and runtime force-kills. SIGKILL during normal teardown (rollout, scale-down, eviction) sets deletionTimestamp, so graceful shutdowns stay silent.
ContainerRestart warning Total restart count increased on update and is <= behavior.ignoreRestartCount; per container with restartCount > 0. Skipped if ignoreRestartsWithExitCodeZero and the last termination exit code was 0.

All container alerts append the last-termination cause to the summary when present (e.g. - last termination: SIGKILL (exit 137) / SIGTERM (exit 143) / Error (exit 1)), so the signal/exit code is visible without opening the Container State block.

Initial sync skips ContainerRestart

On the informer's initial sync (AddFunc), there is no previous pod to compute a restart delta, so only terminal/waiting conditions (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled, ContainerKilled) are evaluated. ContainerRestart fires only on an UpdateFunc where the count increased.

Node

Kind: Node. Condition-type alerts fire only on a status transition (the condition's status changed from the previous object).

Reason Default severity Trigger
NodeNotReady critical Ready condition transitions to a status other than True.
NodeMemoryPressure critical MemoryPressure condition transitions to True.
NodeDiskPressure critical DiskPressure condition transitions to True.
NodePIDPressure critical PIDPressure condition transitions to True.
NodeCordon warning spec.unschedulable becomes true (was unset/false).

Pressure reasons are Node + the Kubernetes condition type

The pressure reasons are built as "Node" + cond.Type, yielding NodeMemoryPressure, NodeDiskPressure, and NodePIDPressure. Node alerts are disabled entirely when the chart is installed with rbac.scope: namespace, since nodes are cluster-scoped.

Deployment

Kind: Deployment.

Reason Default severity Trigger
DeploymentUnavailable warning status.unavailableReplicas > 0.
ProgressDeadlineExceeded critical A Progressing condition with reason == ProgressDeadlineExceeded.

StatefulSet

Kind: StatefulSet.

Reason Default severity Trigger
StatefulSetReplicasUnavailable warning spec.replicas is set and non-zero, status.readyReplicas < spec.replicas, and status.observedGeneration >= metadata.generation (stale-spec guard).

DaemonSet

Kind: DaemonSet.

Reason Default severity Trigger
DaemonSetUnavailable warning status.numberUnavailable > 0.

Job

Kind: Job.

Reason Default severity Trigger
JobFailed critical A Failed condition with status True (backoffLimit hit).

CronJob

Kind: CronJob. Evaluated only on update events (requires a previous object).

Reason Default severity Trigger
CronJobSuspended info spec.suspend transitions to true (was unset/false).
CronJobMissingSuccess warning A new status.lastScheduleTime arrived and the previous tick never produced a success (lastSuccessfulTime is nil or earlier than the old lastScheduleTime).

CronJobMissingSuccess does not parse cron expressions

Detection is event-driven: each new schedule tick is an Update event, and at that moment the watcher checks whether the previous tick ever succeeded. Individual failed runs already alert as JobFailed via the Job watcher.

PersistentVolumeClaim

Kind: PersistentVolumeClaim.

Reason Default severity Trigger
PVCLost critical status.phase == Lost.
PVCPending warning status.phase == Pending and the claim has existed longer than behavior.pvcPendingSeconds.

PVC pending threshold falls back to 5m if non-positive

The watcher uses behavior.pvcPendingSeconds seconds; if that value is <= 0 it falls back to 5 minutes. (Validate() requires pvcPendingSeconds > 0, so this fallback only applies when validation is bypassed.)

HorizontalPodAutoscaler

Kind: HorizontalPodAutoscaler.

Reason Default severity Trigger
HPAMaxedOut warning status.currentReplicas >= spec.maxReplicas and a ScalingLimited condition with status True and reason == TooManyReplicas.

Sitting at max alone does not alert

Both conditions must hold: the HPA must be pinned at maxReplicas and the autoscaler must itself report ScalingLimited == True with reason TooManyReplicas. A workload that happens to need exactly maxReplicas does not alert.