Configuration reference¶

Complete schema for config.yaml (mounted from a ConfigMap). Every key is listed with its YAML type, the default applied by applyEnvDefaults, the legacy v1 environment-variable fallback (used only when the YAML key is unset/zero), the validation rule enforced by Validate() at load, and a description.

Load order: the YAML file is parsed, then env-var fallbacks are layered for unset keys, then Validate() runs. A config path that cannot be read is a hard error - the controller does not boot on env defaults alone.

Env fallbacks fire only on the zero value

Fallbacks apply when the YAML field equals its zero value ("", 0, false). A legitimately-zero numeric value cannot be expressed for keys that have a non-zero default - e.g. setting behavior.muteSeconds: 0 triggers the MUTE_SECONDS lookup and then fails validation (muteSeconds (...) must exceed the informer resync period (300s)).

Top-level¶

Path	Type	Default	Env fallback	Validation	Description
`cluster`	string	`""`	`CLUSTER_NAME`	-	Cluster name rendered into every alert.
`metricsAddr`	string	`:9090`	`METRICS_ADDR`	-	Listen address for `/metrics`, `/healthz`, `/readyz`, and (when co-located) the data plane.
`apiAddr`	string	`""`	`ALERTKUBE_API_ADDR`	-	Optional SEPARATE listen address for the sensitive data plane (`/api/*`, console, receiver). Empty co-locates everything on `metricsAddr`; set it to firewall the data port independently of `/metrics` + probes.

Runtime tuning & scaling (environment only)¶

These are set via environment variables (not the YAML config):

Env var	Default	Description
`ALERTKUBE_DISPATCH_WORKERS`	`16`	Delivery worker-pool size (async fan-out decoupled from the informer thread).
`ALERTKUBE_DISPATCH_QUEUE`	`2048`	Delivery queue capacity before enqueue applies backpressure.
`ALERTKUBE_ENABLE_PPROF`	`false`	Serve `/debug/pprof` (read-token gated, fail-closed).
`ALERTKUBE_SHARD_TOTAL`	`1`	Number of shards for horizontal scaling (`>1` enables sharding).
`ALERTKUBE_SHARD_INDEX`	`0`	This replica's shard, `0..TOTAL-1` (must be unique/stable per replica — see HA & sharding).
`ALERTKUBE_CLIENT_QPS` / `ALERTKUBE_CLIENT_BURST`	`50` / `100`	Kubernetes REST client throttle.

`filters`¶

Namespace and pod-name include/exclude filters. Values are passed to the filter set (comma-separated literals and/or regex, depending on the matcher).

Path	Type	Default	Env fallback	Validation	Description
`filters.watchedNamespaces`	string	`""`	`WATCHED_NAMESPACES`	-	Only namespaces matching are watched (empty = all).
`filters.ignoredNamespaces`	string	`""`	`IGNORED_NAMESPACES`	-	Namespaces matching are excluded.
`filters.watchedPodNamePrefixes`	string	`""`	`WATCHED_POD_NAME_PREFIXES`	-	Only pods whose name matches are watched (empty = all).
`filters.ignoredPodNamePrefixes`	string	`""`	`IGNORED_POD_NAME_PREFIXES`	-	Pods whose name matches are excluded.

`behavior`¶

Path	Type	Default	Env fallback	Validation	Description
`behavior.muteSeconds`	int	`600`	`MUTE_SECONDS`	must be `> 300`	Dedupe mute window: a repeated fingerprint is suppressed for this many seconds. Must exceed the 300s informer resync period.
`behavior.ignoreRestartCount`	int	`30`	`IGNORE_RESTART_COUNT`	must be `>= 0`	Stop per-restart `ContainerRestart` alerts once a pod's total restart count exceeds this (CrashLoopBackOff detection still fires).
`behavior.ignoreRestartsWithExitCodeZero`	bool	`false`	`IGNORE_RESTARTS_WITH_EXIT_CODE_ZERO == "true"`	-	Skip `ContainerRestart` alerts whose previous termination exit code was 0.
`behavior.resolveTTLSeconds`	int	`600`	`RESOLVE_TTL_SECONDS`	must be `> 300`	A fingerprint that stops firing for this long emits a synthetic resolved alert. Must exceed the 300s informer resync period.
`behavior.startupGraceSeconds`	int	`0`	`STARTUP_GRACE_SECONDS`	must be `>= 0`	Suppress alerts fired during the first N seconds after start (mutes informer initial-sync re-fires of standing conditions). `0` disables.
`behavior.pvcPendingSeconds`	int	`300`	`PVC_PENDING_SECONDS`	must be `> 0`	How long a PVC may stay `Pending` before a `PVCPending` alert fires.
`behavior.disableLogCollection`	bool	`false`	-	-	Stop fetching previous-container logs for alert enrichment (redaction is pattern-based and best-effort).
`behavior.disableAnnotationSilences`	bool	`false`	-	-	Ignore the `alert-silence-until` annotation so workload authors cannot self-silence.

Helm default differs from the binary default for startupGraceSeconds

The Go default in applyEnvDefaults is 0 (disabled). The Helm chart values.yaml ships behavior.startupGraceSeconds: 30, so a Helm install gets a 30-second grace window unless overridden.

`channels`¶

Default Slack channel names per severity tier.

Path	Type	Default	Env fallback	Validation	Description
`channels.critical`	string	`alerts-critical`	`SLACK_CHANNEL_CRITICAL`	-	Channel for `critical` alerts.
`channels.warning`	string	`alerts-warning`	`SLACK_CHANNEL_WARNING`, then `SLACK_CHANNEL`	-	Channel for `warning` alerts.
`channels.info`	string	`alerts-info`	`SLACK_CHANNEL_INFO`	-	Channel for `info` alerts.

channels.warning has a two-stage env fallback

When unset, channels.warning first reads SLACK_CHANNEL_WARNING; if that is empty it falls back to the legacy single-channel SLACK_CHANNEL, and only then to the literal alerts-warning.

`routing`¶

List of routing rules. First-match semantics are applied per alert; each rule maps a match map to a list of sinks.

Path	Type	Default	Validation	Description
`routing[].match`	map[string]string	-	-	Field-equality match. Keys `namespace` and `reason` accept an anchored regex; all other keys (`severity`, `kind`, `name`, `node`, label keys) are exact equality.
`routing[].sinks`	[]string	-	non-empty; every entry must be a known sink	Sinks that receive the matched alert.

Known sink names: slack, pagerduty, teams, webhook, stdout, discord, telegram, opsgenie.

`severityOverrides`¶

Remap an alert's severity before dedupe and routing. First match wins.

Path	Type	Default	Validation	Description
`severityOverrides[].match`	map[string]string	-	must be non-empty	Match map; same semantics as `routing` (namespace/reason anchored regex, others exact).
`severityOverrides[].severity`	string	-	must be `critical`, `warning`, or `info`	Severity assigned on match.

`sinkRates`¶

Per-sink token-bucket rate-limit overrides. Keyed by sink name.

Path	Type	Default	Validation	Description
`sinkRates.<sink>.perSecond`	float	`1` per second (unlisted sinks)	must be `> 0`	Sustained send rate.
`sinkRates.<sink>.burst`	int	`5` (unlisted sinks)	must be `>= 1`	Token-bucket burst size.

The map key (<sink>) must be a known sink name. Unlisted sinks keep the conservative default of 1 msg/sec with burst 5 (Slack's published webhook limit).

`inhibitions`¶

Suppress dependent (target) alerts while a source alert is active.

Path	Type	Default	Validation	Description
`inhibitions[].source`	map[string]string	-	-	Match map for the alert that triggers suppression.
`inhibitions[].target`	map[string]string	-	-	Match map for alerts to suppress while a source is active.
`inhibitions[].equal`	[]string	-	-	Label/field names that must be equal between source and target (e.g. `node`).
`inhibitions[].duration`	string (Go duration)	`10m` if empty or unparseable	if set, must parse as a Go duration	How long the inhibition holds after the source fires.

An empty or unparseable duration falls back to 10m at runtime

Validate() rejects a non-empty duration that fails time.ParseDuration. An empty string passes validation and DurationParsed() returns 10m.

`silences`¶

Time-bounded matchers that suppress alerts until a timestamp.

Path	Type	Default	Validation	Description
`silences[].matchers`	map[string]string	-	-	Match map (same field semantics as routing).
`silences[].until`	string (RFC3339)	-	must parse as RFC3339	Silence expiry timestamp.

`escalations`¶

Re-dispatch a still-unresolved matching alert to extra sinks after a delay. Each rule fires at most once per alert lifetime.

Path	Type	Default	Validation	Description
`escalations[].match`	map[string]string	-	-	Match map; same semantics as routing rules.
`escalations[].afterMinutes`	int	-	must be `> 0`	Minutes the alert must remain unresolved before escalating.
`escalations[].sinks`	[]string	-	non-empty; every entry must be a known sink	Additional sinks to re-dispatch to.

`receiver`¶

Alertmanager webhook receiver on POST /api/v1/alerts (served on the metrics address).

Path	Type	Default	Validation	Description
`receiver.enabled`	bool	`false`	-	Enable the Alertmanager webhook receiver. Optional bearer auth via the `ALERTKUBE_RECEIVER_TOKEN` env var.
`receiver.allowAnonymous`	bool	`false`	-	Allow requests without a bearer token. Only safe when the port is locked down by NetworkPolicy.

`grouping`¶

Storm folding: the first alert of a group dispatches immediately; later same-group alerts within the window collapse into one summary. Stateful incident sinks (pagerduty, opsgenie) still receive every resolve and never receive summaries.

Path	Type	Default	Validation	Description
`grouping.enabled`	bool	`false`	-	Enable storm folding.
`grouping.windowSeconds`	int	`30`	when `enabled`, must be `> 0`	Collapse window length.
`grouping.by`	[]string	`[kind, namespace, reason, severity]`	when `enabled`, no entry may be empty	Fields forming the group identity.

grouping.windowSeconds defaults to 30 even when grouping is off

applyEnvDefaults sets windowSeconds to 30 whenever it is 0, regardless of enabled. Validation of windowSeconds/by only runs when grouping.enabled is true.

`persistence`¶

Snapshot active-alert and mute state to a ConfigMap so a restart does not lose pending resolves or re-page muted standing conditions. Requires get/create/update on the named ConfigMap.

Path	Type	Default	Env fallback	Validation	Description
`persistence.enabled`	bool	`false`	-	-	Enable state snapshotting.
`persistence.configMapName`	string	`alertkube-state`	-	-	Name of the state ConfigMap.
`persistence.namespace`	string	`""`	`POD_NAMESPACE`	when `enabled`, must be non-empty	Namespace of the state ConfigMap.

persistence.enabled requires a resolvable namespace

If persistence.enabled is true and persistence.namespace is empty after the POD_NAMESPACE fallback, load fails. The Helm chart sets POD_NAMESPACE via the Downward API. (The chart also defaults persistence.enabled: true, whereas the binary default is false.)

Annotated example¶

cluster: prod-us-east-1
metricsAddr: ":9090"

filters:
  watchedNamespaces: "^(prod|staging)-.*"   # regex; only these namespaces
  ignoredPodNamePrefixes: "debug-,test-"     # comma-separated prefixes

behavior:
  muteSeconds: 600                  # dedupe mute window (>300)
  ignoreRestartCount: 30            # stop per-restart alerts past this count
  ignoreRestartsWithExitCodeZero: false
  resolveTTLSeconds: 600            # synthetic resolve after this idle period (>300)
  startupGraceSeconds: 30           # mute initial-sync re-fires; 0 disables
  pvcPendingSeconds: 300            # PVC Pending tolerance before alerting (>0)
  disableLogCollection: false       # skip previous-container log enrichment
  disableAnnotationSilences: false  # ignore alert-silence-until annotations

persistence:
  enabled: true
  configMapName: alertkube-state    # namespace defaults to POD_NAMESPACE

channels:
  critical: alerts-critical
  warning:  alerts-warning
  info:     alerts-info

routing:
  - match: {severity: critical}
    sinks: [slack, pagerduty]
  - match: {severity: warning, namespace: prod-.*}   # namespace is an anchored regex
    sinks: [slack]
  - match: {severity: info}
    sinks: [slack]

severityOverrides:
  - match: {kind: Pod, reason: ImagePullBackOff, namespace: dev-.*}
    severity: info

sinkRates:
  pagerduty:
    perSecond: 10
    burst: 20

grouping:
  enabled: false
  windowSeconds: 30
  by: [kind, namespace, reason, severity]

escalations:
  - match: {severity: critical}
    afterMinutes: 15
    sinks: [pagerduty]

receiver:
  enabled: false                    # bearer auth via ALERTKUBE_RECEIVER_TOKEN

inhibitions:
  - source: {kind: Node, reason: NodeNotReady}
    target: {kind: Pod}
    equal: [node]
    duration: 10m                   # Go duration; empty/invalid -> 10m

silences:
  - matchers: {namespace: kube-system}
    until: "2026-06-15T00:00:00Z"   # RFC3339

Configuration reference¶

Top-level¶

Runtime tuning & scaling (environment only)¶

filters¶

behavior¶

channels¶

routing¶

severityOverrides¶

sinkRates¶

inhibitions¶

silences¶

escalations¶

receiver¶

grouping¶

persistence¶