Watcher conditions¶
Every alert reason emitted by each resource watcher, its default severity, and
the exact condition that triggers it. Default severities are hardcoded in the
watcher source and may be remapped with severityOverrides.
Reasons are the fourth component of the dedupe fingerprint
(sha256(kind|namespace|name|reason)) and are the values matched by
routing, severityOverrides, inhibitions, and silences reason keys
(which accept an anchored regex).
Pod¶
Kind: Pod. Container waiting/terminated states are checked first and short-
circuit; per-restart alerts run only when no waiting/OOM state matched.
| Reason | Default severity | Trigger |
|---|---|---|
CrashLoopBackOff |
critical | A container status has state.waiting.reason == CrashLoopBackOff. |
ImagePullBackOff |
warning | A container status has state.waiting.reason == ImagePullBackOff. |
ErrImagePull |
warning | A container status has state.waiting.reason == ErrImagePull. |
OOMKilled |
critical | A container's lastTerminationState.terminated.reason == OOMKilled. |
ContainerKilled |
warning | A container's last termination was a non-OOM SIGKILL (exitCode == 137 or signal == 9) and the pod is not being deleted (metadata.deletionTimestamp unset). Catches liveness-probe escalation, terminationGracePeriodSeconds exceeded mid-run, and runtime force-kills. SIGKILL during normal teardown (rollout, scale-down, eviction) sets deletionTimestamp, so graceful shutdowns stay silent. |
ContainerRestart |
warning | Total restart count increased on update and is <= behavior.ignoreRestartCount; per container with restartCount > 0. Skipped if ignoreRestartsWithExitCodeZero and the last termination exit code was 0. |
All container alerts append the last-termination cause to the summary when present (e.g. - last termination: SIGKILL (exit 137) / SIGTERM (exit 143) / Error (exit 1)), so the signal/exit code is visible without opening the Container State block.
Initial sync skips ContainerRestart
On the informer's initial sync (AddFunc), there is no previous pod to
compute a restart delta, so only terminal/waiting conditions
(CrashLoopBackOff, ImagePullBackOff, ErrImagePull, OOMKilled,
ContainerKilled) are evaluated. ContainerRestart fires only on an UpdateFunc where the count
increased.
Node¶
Kind: Node. Condition-type alerts fire only on a status transition (the
condition's status changed from the previous object).
| Reason | Default severity | Trigger |
|---|---|---|
NodeNotReady |
critical | Ready condition transitions to a status other than True. |
NodeMemoryPressure |
critical | MemoryPressure condition transitions to True. |
NodeDiskPressure |
critical | DiskPressure condition transitions to True. |
NodePIDPressure |
critical | PIDPressure condition transitions to True. |
NodeCordon |
warning | spec.unschedulable becomes true (was unset/false). |
Pressure reasons are Node + the Kubernetes condition type
The pressure reasons are built as "Node" + cond.Type, yielding
NodeMemoryPressure, NodeDiskPressure, and NodePIDPressure. Node alerts
are disabled entirely when the chart is installed with rbac.scope:
namespace, since nodes are cluster-scoped.
Deployment¶
Kind: Deployment.
| Reason | Default severity | Trigger |
|---|---|---|
DeploymentUnavailable |
warning | status.unavailableReplicas > 0. |
ProgressDeadlineExceeded |
critical | A Progressing condition with reason == ProgressDeadlineExceeded. |
StatefulSet¶
Kind: StatefulSet.
| Reason | Default severity | Trigger |
|---|---|---|
StatefulSetReplicasUnavailable |
warning | spec.replicas is set and non-zero, status.readyReplicas < spec.replicas, and status.observedGeneration >= metadata.generation (stale-spec guard). |
DaemonSet¶
Kind: DaemonSet.
| Reason | Default severity | Trigger |
|---|---|---|
DaemonSetUnavailable |
warning | status.numberUnavailable > 0. |
Job¶
Kind: Job.
| Reason | Default severity | Trigger |
|---|---|---|
JobFailed |
critical | A Failed condition with status True (backoffLimit hit). |
CronJob¶
Kind: CronJob. Evaluated only on update events (requires a previous object).
| Reason | Default severity | Trigger |
|---|---|---|
CronJobSuspended |
info | spec.suspend transitions to true (was unset/false). |
CronJobMissingSuccess |
warning | A new status.lastScheduleTime arrived and the previous tick never produced a success (lastSuccessfulTime is nil or earlier than the old lastScheduleTime). |
CronJobMissingSuccess does not parse cron expressions
Detection is event-driven: each new schedule tick is an Update event, and
at that moment the watcher checks whether the previous tick ever
succeeded. Individual failed runs already alert as JobFailed via the Job
watcher.
PersistentVolumeClaim¶
Kind: PersistentVolumeClaim.
| Reason | Default severity | Trigger |
|---|---|---|
PVCLost |
critical | status.phase == Lost. |
PVCPending |
warning | status.phase == Pending and the claim has existed longer than behavior.pvcPendingSeconds. |
PVC pending threshold falls back to 5m if non-positive
The watcher uses behavior.pvcPendingSeconds seconds; if that value is
<= 0 it falls back to 5 minutes. (Validate() requires
pvcPendingSeconds > 0, so this fallback only applies when validation is
bypassed.)
HorizontalPodAutoscaler¶
Kind: HorizontalPodAutoscaler.
| Reason | Default severity | Trigger |
|---|---|---|
HPAMaxedOut |
warning | status.currentReplicas >= spec.maxReplicas and a ScalingLimited condition with status True and reason == TooManyReplicas. |
Sitting at max alone does not alert
Both conditions must hold: the HPA must be pinned at maxReplicas and the
autoscaler must itself report ScalingLimited == True with reason
TooManyReplicas. A workload that happens to need exactly maxReplicas
does not alert.