Metrics reference¶
Every Prometheus metric exported by alertkube. All metrics are registered at
startup and served on the metrics address (metricsAddr, default :9090) at
/metrics.
Metrics¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
alertkube_alerts_total |
counter | kind, severity, reason |
Alerts emitted, by resource kind, severity, and reason. |
alertkube_alerts_suppressed_total |
counter | reason |
Alerts suppressed, labelled by the suppression reason (dedupe mute, inhibition, silence, etc.). |
alertkube_sink_send_seconds |
histogram | sink, result |
Sink send latency, partitioned by sink name and outcome (result). |
alertkube_sink_errors_total |
counter | sink |
Sink send errors, by sink name. |
alertkube_active_alerts |
gauge | - | Count of currently active (unresolved) alerts. |
alertkube_dispatch_inflight |
gauge | sink |
Sink sends currently in flight, including time queued on the rate limiter. A value pinned high for a sink indicates a storm is queueing and rate-limit drops are imminent. |
alertkube_escalations_total |
counter | - | Alerts re-dispatched by escalation rules. |
alertkube_enrichment_saturated_total |
counter | - | Pod alerts emitted without enrichment (previous container logs) because the bounded enrichment worker pool was full. A rising value during storms indicates the pool size should be increased. |
alertkube_received_alerts_total |
counter | status |
Alerts accepted by the Alertmanager webhook receiver, by status. |
alertkube_sink_breaker_open |
gauge | sink |
1 while a sink's circuit breaker is open (delivery short-circuited after sustained failures), 0 otherwise. Stuck at 1 means that sink's endpoint is down. |
alertkube_sink_noop_total |
counter | sink |
Sends that no-oped because the sink's credential was not configured. A routed sink that no-ops silently drops the alert — alert on this if non-zero. |
alertkube_alerts_dropped_total |
counter | - | Alerts whose every routed sink failed delivery (dedupe rolled back for retry). |
alertkube_dispatch_queue_depth |
gauge | - | Alerts buffered in the async dispatch worker-pool queue. Trending toward capacity means workers are not draining fast enough (slow sinks / rate limits). |
alertkube_dispatch_queue_full_total |
counter | - | Enqueue attempts that blocked because the dispatch queue was full (backpressure). |
alertkube_dispatch_resolve_retries_total |
counter | - | Resolves re-queued after a failed delivery (a lost resolve would dangle a stateful incident). |
alertkube_dispatch_dropped_total |
counter | - | Alerts dropped because they were enqueued after dispatcher shutdown (shutdown-drain race only). |
alertkube_outbox_pending |
gauge | - | Undelivered deliveries tracked in the durable outbox (persisted + replayed on restart). Stuck high means delivery is falling behind. |
alertkube_dead_letter_total |
counter | - | Deliveries permanently abandoned with no retry path (exhausted resolve, or a failed fire-once event/summary/escalation). Inspect GET /api/deadletter. |
alertkube_cloud_poll_errors_total |
counter | source |
Failed cloud-provider API calls, by source (e.g. aws-eks). |
alertkube_cloud_poll_truncated_total |
counter | source |
Cloud polls that hit a pagination cap and dropped remaining items (e.g. CloudTrail's per-event page limit). |
alertkube_state_snapshot_bytes |
gauge | - | Size of the last (compressed) state snapshot serialized for persistence. Watch against the ConfigMap object limit. |
alertkube_state_save_skipped_total |
counter | - | State saves skipped because the compressed snapshot exceeded the size guard. Non-zero means persisted state is going stale. |
alertkube_runtime_mutations_total |
counter | action |
Control-plane writes via the console API (silence create/delete, channel test), by action. |
alertkube_sink_send_seconds is a histogram
It exposes the standard Prometheus histogram series:
alertkube_sink_send_seconds_bucket,
alertkube_sink_send_seconds_sum, and
alertkube_sink_send_seconds_count, each carrying the sink and result
labels.
Label values¶
| Label | Values |
|---|---|
kind |
Pod, Node, Deployment, PersistentVolumeClaim, Job, DaemonSet, StatefulSet, CronJob, HorizontalPodAutoscaler, External (receiver-ingested). |
severity |
critical, warning, info. |
reason |
The watcher reason string (see Watcher conditions). |
sink |
slack, pagerduty, teams, webhook, stdout, discord, telegram, opsgenie. |
HTTP endpoints¶
Served on metricsAddr. Server timeouts: 5s read-header, 10s read, 10s write,
60s idle.
| Path | Method | Description |
|---|---|---|
/metrics |
GET | Prometheus exposition of all alertkube_* metrics. |
/healthz |
GET | Liveness. 200 normally; a leader whose sweep heartbeat has gone stale (e.g. a store-lock deadlock) returns 503 so the kubelet restarts the wedged pod. Followers and the initial-sync window stay 200. |
/readyz |
GET | Readiness. Returns 503 until informer caches have synced (MarkReady); used so the kubelet does not mark the pod Ready while the controller is blind. On leader-election followers, flipped back to not-ready when the lease is not held. |
/api/alerts |
GET | JSON of active alerts plus recent history. Returns 503 until the handler is installed (after the controller and its store exist). |
/api/deadletter |
GET | JSON of recently dead-lettered deliveries (permanently abandoned). Token-gated (read token); returns 503 until installed. |
/api/v1/alerts |
POST | Alertmanager webhook receiver (when receiver.enabled). Runs payloads through the same dedupe/grouping/routing/sink pipeline. Optional bearer auth via ALERTKUBE_RECEIVER_TOKEN. Returns 503 until the handler is installed. |
/debug/pprof/ |
GET | Go profiling, opt-in via ALERTKUBE_ENABLE_PPROF and gated by the read token (fail-closed without one). 503 when disabled. |
/api/alerts and /api/v1/alerts return 503 before the controller starts
The HTTP server boots in main() before the controller (and its alert
store) exists; on leader-election followers the controller never starts at
all. Until each handler is installed, its route returns 503.
Splitting the sensitive data plane onto its own port
Set apiAddr (env ALERTKUBE_API_ADDR) to serve /api/*, the console, and
the receiver on a separate listener from /metrics + the probes, so the
metrics/probe port can stay open for scraping while the data port is
firewalled with a NetworkPolicy. Empty (default) co-locates everything on
metricsAddr.
Grafana dashboard¶
An importable dashboard built on these metrics ships in the repository at
docs/grafana-dashboard.json.
ServiceMonitor¶
Prometheus Operator scraping is available via the Helm chart: