Skip to content

Metrics reference

Every Prometheus metric exported by alertkube. All metrics are registered at startup and served on the metrics address (metricsAddr, default :9090) at /metrics.

Metrics

Metric Type Labels Meaning
alertkube_alerts_total counter kind, severity, reason Alerts emitted, by resource kind, severity, and reason.
alertkube_alerts_suppressed_total counter reason Alerts suppressed, labelled by the suppression reason (dedupe mute, inhibition, silence, etc.).
alertkube_sink_send_seconds histogram sink, result Sink send latency, partitioned by sink name and outcome (result).
alertkube_sink_errors_total counter sink Sink send errors, by sink name.
alertkube_active_alerts gauge - Count of currently active (unresolved) alerts.
alertkube_dispatch_inflight gauge sink Sink sends currently in flight, including time queued on the rate limiter. A value pinned high for a sink indicates a storm is queueing and rate-limit drops are imminent.
alertkube_escalations_total counter - Alerts re-dispatched by escalation rules.
alertkube_enrichment_saturated_total counter - Pod alerts emitted without enrichment (previous container logs) because the bounded enrichment worker pool was full. A rising value during storms indicates the pool size should be increased.
alertkube_received_alerts_total counter status Alerts accepted by the Alertmanager webhook receiver, by status.
alertkube_sink_breaker_open gauge sink 1 while a sink's circuit breaker is open (delivery short-circuited after sustained failures), 0 otherwise. Stuck at 1 means that sink's endpoint is down.
alertkube_sink_noop_total counter sink Sends that no-oped because the sink's credential was not configured. A routed sink that no-ops silently drops the alert — alert on this if non-zero.
alertkube_alerts_dropped_total counter - Alerts whose every routed sink failed delivery (dedupe rolled back for retry).
alertkube_dispatch_queue_depth gauge - Alerts buffered in the async dispatch worker-pool queue. Trending toward capacity means workers are not draining fast enough (slow sinks / rate limits).
alertkube_dispatch_queue_full_total counter - Enqueue attempts that blocked because the dispatch queue was full (backpressure).
alertkube_dispatch_resolve_retries_total counter - Resolves re-queued after a failed delivery (a lost resolve would dangle a stateful incident).
alertkube_dispatch_dropped_total counter - Alerts dropped because they were enqueued after dispatcher shutdown (shutdown-drain race only).
alertkube_outbox_pending gauge - Undelivered deliveries tracked in the durable outbox (persisted + replayed on restart). Stuck high means delivery is falling behind.
alertkube_dead_letter_total counter - Deliveries permanently abandoned with no retry path (exhausted resolve, or a failed fire-once event/summary/escalation). Inspect GET /api/deadletter.
alertkube_cloud_poll_errors_total counter source Failed cloud-provider API calls, by source (e.g. aws-eks).
alertkube_cloud_poll_truncated_total counter source Cloud polls that hit a pagination cap and dropped remaining items (e.g. CloudTrail's per-event page limit).
alertkube_state_snapshot_bytes gauge - Size of the last (compressed) state snapshot serialized for persistence. Watch against the ConfigMap object limit.
alertkube_state_save_skipped_total counter - State saves skipped because the compressed snapshot exceeded the size guard. Non-zero means persisted state is going stale.
alertkube_runtime_mutations_total counter action Control-plane writes via the console API (silence create/delete, channel test), by action.

alertkube_sink_send_seconds is a histogram

It exposes the standard Prometheus histogram series: alertkube_sink_send_seconds_bucket, alertkube_sink_send_seconds_sum, and alertkube_sink_send_seconds_count, each carrying the sink and result labels.

Label values

Label Values
kind Pod, Node, Deployment, PersistentVolumeClaim, Job, DaemonSet, StatefulSet, CronJob, HorizontalPodAutoscaler, External (receiver-ingested).
severity critical, warning, info.
reason The watcher reason string (see Watcher conditions).
sink slack, pagerduty, teams, webhook, stdout, discord, telegram, opsgenie.

HTTP endpoints

Served on metricsAddr. Server timeouts: 5s read-header, 10s read, 10s write, 60s idle.

Path Method Description
/metrics GET Prometheus exposition of all alertkube_* metrics.
/healthz GET Liveness. 200 normally; a leader whose sweep heartbeat has gone stale (e.g. a store-lock deadlock) returns 503 so the kubelet restarts the wedged pod. Followers and the initial-sync window stay 200.
/readyz GET Readiness. Returns 503 until informer caches have synced (MarkReady); used so the kubelet does not mark the pod Ready while the controller is blind. On leader-election followers, flipped back to not-ready when the lease is not held.
/api/alerts GET JSON of active alerts plus recent history. Returns 503 until the handler is installed (after the controller and its store exist).
/api/deadletter GET JSON of recently dead-lettered deliveries (permanently abandoned). Token-gated (read token); returns 503 until installed.
/api/v1/alerts POST Alertmanager webhook receiver (when receiver.enabled). Runs payloads through the same dedupe/grouping/routing/sink pipeline. Optional bearer auth via ALERTKUBE_RECEIVER_TOKEN. Returns 503 until the handler is installed.
/debug/pprof/ GET Go profiling, opt-in via ALERTKUBE_ENABLE_PPROF and gated by the read token (fail-closed without one). 503 when disabled.

/api/alerts and /api/v1/alerts return 503 before the controller starts

The HTTP server boots in main() before the controller (and its alert store) exists; on leader-election followers the controller never starts at all. Until each handler is installed, its route returns 503.

Splitting the sensitive data plane onto its own port

Set apiAddr (env ALERTKUBE_API_ADDR) to serve /api/*, the console, and the receiver on a separate listener from /metrics + the probes, so the metrics/probe port can stay open for scraping while the data port is firewalled with a NetworkPolicy. Empty (default) co-locates everything on metricsAddr.

Grafana dashboard

An importable dashboard built on these metrics ships in the repository at docs/grafana-dashboard.json.

ServiceMonitor

Prometheus Operator scraping is available via the Helm chart:

metrics:
  enabled: true
  port: 9090
  serviceMonitor:
    enabled: true
    interval: 30s
    labels: {}