Skip to content

Configuration reference

Complete schema for config.yaml (mounted from a ConfigMap). Every key is listed with its YAML type, the default applied by applyEnvDefaults, the legacy v1 environment-variable fallback (used only when the YAML key is unset/zero), the validation rule enforced by Validate() at load, and a description.

Load order: the YAML file is parsed, then env-var fallbacks are layered for unset keys, then Validate() runs. A config path that cannot be read is a hard error - the controller does not boot on env defaults alone.

Env fallbacks fire only on the zero value

Fallbacks apply when the YAML field equals its zero value ("", 0, false). A legitimately-zero numeric value cannot be expressed for keys that have a non-zero default - e.g. setting behavior.muteSeconds: 0 triggers the MUTE_SECONDS lookup and then fails validation (muteSeconds (...) must exceed the informer resync period (300s)).

Top-level

Path Type Default Env fallback Validation Description
cluster string "" CLUSTER_NAME - Cluster name rendered into every alert.
metricsAddr string :9090 METRICS_ADDR - Listen address for /metrics, /healthz, /readyz, and (when co-located) the data plane.
apiAddr string "" ALERTKUBE_API_ADDR - Optional SEPARATE listen address for the sensitive data plane (/api/*, console, receiver). Empty co-locates everything on metricsAddr; set it to firewall the data port independently of /metrics + probes.

Runtime tuning & scaling (environment only)

These are set via environment variables (not the YAML config):

Env var Default Description
ALERTKUBE_DISPATCH_WORKERS 16 Delivery worker-pool size (async fan-out decoupled from the informer thread).
ALERTKUBE_DISPATCH_QUEUE 2048 Delivery queue capacity before enqueue applies backpressure.
ALERTKUBE_ENABLE_PPROF false Serve /debug/pprof (read-token gated, fail-closed).
ALERTKUBE_SHARD_TOTAL 1 Number of shards for horizontal scaling (>1 enables sharding).
ALERTKUBE_SHARD_INDEX 0 This replica's shard, 0..TOTAL-1 (must be unique/stable per replica — see HA & sharding).
ALERTKUBE_CLIENT_QPS / ALERTKUBE_CLIENT_BURST 50 / 100 Kubernetes REST client throttle.

filters

Namespace and pod-name include/exclude filters. Values are passed to the filter set (comma-separated literals and/or regex, depending on the matcher).

Path Type Default Env fallback Validation Description
filters.watchedNamespaces string "" WATCHED_NAMESPACES - Only namespaces matching are watched (empty = all).
filters.ignoredNamespaces string "" IGNORED_NAMESPACES - Namespaces matching are excluded.
filters.watchedPodNamePrefixes string "" WATCHED_POD_NAME_PREFIXES - Only pods whose name matches are watched (empty = all).
filters.ignoredPodNamePrefixes string "" IGNORED_POD_NAME_PREFIXES - Pods whose name matches are excluded.

behavior

Path Type Default Env fallback Validation Description
behavior.muteSeconds int 600 MUTE_SECONDS must be > 300 Dedupe mute window: a repeated fingerprint is suppressed for this many seconds. Must exceed the 300s informer resync period.
behavior.ignoreRestartCount int 30 IGNORE_RESTART_COUNT must be >= 0 Stop per-restart ContainerRestart alerts once a pod's total restart count exceeds this (CrashLoopBackOff detection still fires).
behavior.ignoreRestartsWithExitCodeZero bool false IGNORE_RESTARTS_WITH_EXIT_CODE_ZERO == "true" - Skip ContainerRestart alerts whose previous termination exit code was 0.
behavior.resolveTTLSeconds int 600 RESOLVE_TTL_SECONDS must be > 300 A fingerprint that stops firing for this long emits a synthetic resolved alert. Must exceed the 300s informer resync period.
behavior.startupGraceSeconds int 0 STARTUP_GRACE_SECONDS must be >= 0 Suppress alerts fired during the first N seconds after start (mutes informer initial-sync re-fires of standing conditions). 0 disables.
behavior.pvcPendingSeconds int 300 PVC_PENDING_SECONDS must be > 0 How long a PVC may stay Pending before a PVCPending alert fires.
behavior.disableLogCollection bool false - - Stop fetching previous-container logs for alert enrichment (redaction is pattern-based and best-effort).
behavior.disableAnnotationSilences bool false - - Ignore the alert-silence-until annotation so workload authors cannot self-silence.

Helm default differs from the binary default for startupGraceSeconds

The Go default in applyEnvDefaults is 0 (disabled). The Helm chart values.yaml ships behavior.startupGraceSeconds: 30, so a Helm install gets a 30-second grace window unless overridden.

channels

Default Slack channel names per severity tier.

Path Type Default Env fallback Validation Description
channels.critical string alerts-critical SLACK_CHANNEL_CRITICAL - Channel for critical alerts.
channels.warning string alerts-warning SLACK_CHANNEL_WARNING, then SLACK_CHANNEL - Channel for warning alerts.
channels.info string alerts-info SLACK_CHANNEL_INFO - Channel for info alerts.

channels.warning has a two-stage env fallback

When unset, channels.warning first reads SLACK_CHANNEL_WARNING; if that is empty it falls back to the legacy single-channel SLACK_CHANNEL, and only then to the literal alerts-warning.

routing

List of routing rules. First-match semantics are applied per alert; each rule maps a match map to a list of sinks.

Path Type Default Validation Description
routing[].match map[string]string - - Field-equality match. Keys namespace and reason accept an anchored regex; all other keys (severity, kind, name, node, label keys) are exact equality.
routing[].sinks []string - non-empty; every entry must be a known sink Sinks that receive the matched alert.

Known sink names: slack, pagerduty, teams, webhook, stdout, discord, telegram, opsgenie.

severityOverrides

Remap an alert's severity before dedupe and routing. First match wins.

Path Type Default Validation Description
severityOverrides[].match map[string]string - must be non-empty Match map; same semantics as routing (namespace/reason anchored regex, others exact).
severityOverrides[].severity string - must be critical, warning, or info Severity assigned on match.

sinkRates

Per-sink token-bucket rate-limit overrides. Keyed by sink name.

Path Type Default Validation Description
sinkRates.<sink>.perSecond float 1 per second (unlisted sinks) must be > 0 Sustained send rate.
sinkRates.<sink>.burst int 5 (unlisted sinks) must be >= 1 Token-bucket burst size.

The map key (<sink>) must be a known sink name. Unlisted sinks keep the conservative default of 1 msg/sec with burst 5 (Slack's published webhook limit).

inhibitions

Suppress dependent (target) alerts while a source alert is active.

Path Type Default Validation Description
inhibitions[].source map[string]string - - Match map for the alert that triggers suppression.
inhibitions[].target map[string]string - - Match map for alerts to suppress while a source is active.
inhibitions[].equal []string - - Label/field names that must be equal between source and target (e.g. node).
inhibitions[].duration string (Go duration) 10m if empty or unparseable if set, must parse as a Go duration How long the inhibition holds after the source fires.

An empty or unparseable duration falls back to 10m at runtime

Validate() rejects a non-empty duration that fails time.ParseDuration. An empty string passes validation and DurationParsed() returns 10m.

silences

Time-bounded matchers that suppress alerts until a timestamp.

Path Type Default Validation Description
silences[].matchers map[string]string - - Match map (same field semantics as routing).
silences[].until string (RFC3339) - must parse as RFC3339 Silence expiry timestamp.

escalations

Re-dispatch a still-unresolved matching alert to extra sinks after a delay. Each rule fires at most once per alert lifetime.

Path Type Default Validation Description
escalations[].match map[string]string - - Match map; same semantics as routing rules.
escalations[].afterMinutes int - must be > 0 Minutes the alert must remain unresolved before escalating.
escalations[].sinks []string - non-empty; every entry must be a known sink Additional sinks to re-dispatch to.

receiver

Alertmanager webhook receiver on POST /api/v1/alerts (served on the metrics address).

Path Type Default Validation Description
receiver.enabled bool false - Enable the Alertmanager webhook receiver. Optional bearer auth via the ALERTKUBE_RECEIVER_TOKEN env var.
receiver.allowAnonymous bool false - Allow requests without a bearer token. Only safe when the port is locked down by NetworkPolicy.

grouping

Storm folding: the first alert of a group dispatches immediately; later same-group alerts within the window collapse into one summary. Stateful incident sinks (pagerduty, opsgenie) still receive every resolve and never receive summaries.

Path Type Default Validation Description
grouping.enabled bool false - Enable storm folding.
grouping.windowSeconds int 30 when enabled, must be > 0 Collapse window length.
grouping.by []string [kind, namespace, reason, severity] when enabled, no entry may be empty Fields forming the group identity.

grouping.windowSeconds defaults to 30 even when grouping is off

applyEnvDefaults sets windowSeconds to 30 whenever it is 0, regardless of enabled. Validation of windowSeconds/by only runs when grouping.enabled is true.

persistence

Snapshot active-alert and mute state to a ConfigMap so a restart does not lose pending resolves or re-page muted standing conditions. Requires get/create/update on the named ConfigMap.

Path Type Default Env fallback Validation Description
persistence.enabled bool false - - Enable state snapshotting.
persistence.configMapName string alertkube-state - - Name of the state ConfigMap.
persistence.namespace string "" POD_NAMESPACE when enabled, must be non-empty Namespace of the state ConfigMap.

persistence.enabled requires a resolvable namespace

If persistence.enabled is true and persistence.namespace is empty after the POD_NAMESPACE fallback, load fails. The Helm chart sets POD_NAMESPACE via the Downward API. (The chart also defaults persistence.enabled: true, whereas the binary default is false.)

Annotated example

cluster: prod-us-east-1
metricsAddr: ":9090"

filters:
  watchedNamespaces: "^(prod|staging)-.*"   # regex; only these namespaces
  ignoredPodNamePrefixes: "debug-,test-"     # comma-separated prefixes

behavior:
  muteSeconds: 600                  # dedupe mute window (>300)
  ignoreRestartCount: 30            # stop per-restart alerts past this count
  ignoreRestartsWithExitCodeZero: false
  resolveTTLSeconds: 600            # synthetic resolve after this idle period (>300)
  startupGraceSeconds: 30           # mute initial-sync re-fires; 0 disables
  pvcPendingSeconds: 300            # PVC Pending tolerance before alerting (>0)
  disableLogCollection: false       # skip previous-container log enrichment
  disableAnnotationSilences: false  # ignore alert-silence-until annotations

persistence:
  enabled: true
  configMapName: alertkube-state    # namespace defaults to POD_NAMESPACE

channels:
  critical: alerts-critical
  warning:  alerts-warning
  info:     alerts-info

routing:
  - match: {severity: critical}
    sinks: [slack, pagerduty]
  - match: {severity: warning, namespace: prod-.*}   # namespace is an anchored regex
    sinks: [slack]
  - match: {severity: info}
    sinks: [slack]

severityOverrides:
  - match: {kind: Pod, reason: ImagePullBackOff, namespace: dev-.*}
    severity: info

sinkRates:
  pagerduty:
    perSecond: 10
    burst: 20

grouping:
  enabled: false
  windowSeconds: 30
  by: [kind, namespace, reason, severity]

escalations:
  - match: {severity: critical}
    afterMinutes: 15
    sinks: [pagerduty]

receiver:
  enabled: false                    # bearer auth via ALERTKUBE_RECEIVER_TOKEN

inhibitions:
  - source: {kind: Node, reason: NodeNotReady}
    target: {kind: Pod}
    equal: [node]
    duration: 10m                   # Go duration; empty/invalid -> 10m

silences:
  - matchers: {namespace: kube-system}
    until: "2026-06-15T00:00:00Z"   # RFC3339