Troubleshoot with Metrics¶

Use /metrics and the Grafana dashboard to answer four questions: did alerts fire, were they suppressed, did sinks receive them, and is dispatch backlogged?

For a complete metric reference, see Metrics reference.

Essential Metrics¶

Metric	Query	Meaning	What to do if high
Alert volume	`alertkube_alerts_total`	Alerts emitted, by kind/severity/reason	Expected; use alerting rules to page if sustained
Suppression	`alertkube_alerts_suppressed_total`	Alerts dropped, by reason (dedupe/inhibited/silenced/grouped)	Expected; too high = tune mute window or grouping
Active alerts	`alertkube_active_alerts`	Count of currently firing, unresolved alerts	High during incidents; should drop after fixes
Dispatch in-flight	`alertkube_dispatch_inflight`	Sink sends currently queued on the rate limiter	Pinned high = storm is queueing; see rate-limiting section
Sink send latency	`alertkube_sink_send_seconds`	Time from dispatch to sink response, by sink and result	Expected; latency spikes = sink is slow
Sink errors	`alertkube_sink_errors_total`	Failed sends, by sink name	Rising = sink is down or credentials are stale
Escalations	`alertkube_escalations_total`	Alerts re-dispatched by escalation rules	Expected if escalations are configured
Enrichment saturation	`alertkube_enrichment_saturated_total`	Pod alerts shipped without logs because the enrichment pool was full	Rising = storm pressure; use grouping, raise `muteSeconds`, or disable log collection

Common Scenarios¶

Nothing is alerting¶

Symptom: alertkube_alerts_total is zero or stuck, even when you trigger an alert condition.

Check:

Is the controller running?

kubectl get pods -l app.kubernetes.io/name=alertkube

If not, check logs for YAML syntax, invalid regex, unknown sinks, or validation failures:

kubectl logs -l app.kubernetes.io/name=alertkube --tail=100

Is the config mounted?

kubectl exec -it deploy/alertkube -- cat /config/config.yaml

Are namespace filters excluding your test namespace?

kubectl get pods -l app.kubernetes.io/name=alertkube -o jsonpath='{.items[0].spec.env[?(@.name=="WATCH_NAMESPACE")].value}'

Does the metrics endpoint expose alertkube_*?

kubectl port-forward svc/alertkube 9090:9090 &
curl http://localhost:9090/metrics | grep alertkube_

No alertkube_* metrics usually means the HTTP server did not start.

Alerts are firing but not reaching Slack/PagerDuty¶

Symptom: alertkube_alerts_total is rising, but nothing appears in the sink.

Check:

Is the sink being called?
```
curl http://localhost:9090/metrics | grep alertkube_sink_send_seconds_total
```
Missing sink metrics usually mean no routing rule matched. result="error" means dispatch is failing.
Verify sink credentials:
```
kubectl get secret alertkube -o jsonpath='{.data.slackWebhookUrl}' | base64 -d
```
Check that the credential is set, valid, and reachable from the pod.

Check logs:

kubectl logs -l app.kubernetes.io/name=alertkube --tail=200 | grep -i error

Check routing:

kubectl get cm alertkube-config -o yaml | grep -A 20 routing:

reason and namespace matchers are anchored regexes.

Alerts are being suppressed unexpectedly¶

Symptom: You trigger an alert condition, but alertkube_alerts_total does not increase, or it increases but the sink does not receive it.

Check suppression counters:

```bash
curl http://localhost:9090/metrics | grep alertkube_alerts_suppressed_total
```

The `reason` label tells you why:
- **`dedupe` / `muted`** - same fingerprint fired recently (within `muteSeconds`). Wait, trigger from a fresh pod, or lower `muteSeconds` for testing.
- **`silenced`** - a `silences:` config or `alert-silence-until` annotation matched.
- **`inhibited`** - an inhibition rule suppressed it (e.g., pods on a down node).
- **`grouped`** - it was the 2nd or later alert in a group within `windowSeconds`.

Then check the matching mechanism:

```bash
kubectl get cm alertkube-config -o yaml | grep muteSeconds
```

dedupe / muted: wait, use a fresh object, or lower muteSeconds for testing.
silenced: inspect config silences and alert-silence-until annotations.
inhibited: inspect source alerts and inhibition rules.
grouped: check the grouping window and group fields.

Sinks are getting rate-limited¶

Symptom: alertkube_dispatch_inflight pins high (e.g., stays at 20+ for minutes), and you see alertkube_sink_errors_total rising.

Identify the sink:

```bash
curl http://localhost:9090/metrics | grep alertkube_dispatch_inflight | sort -t'{' -k2
```

Raise that sink's rate or enable grouping:

```yaml
sinkRates:
  pagerduty:
    perSecond: 20     # was 1
    burst: 50         # was 5
```

Pod enrichment is being skipped¶

Symptom: alertkube_enrichment_saturated_total is rising, and pod alerts lack the Container logs block.

The enrichment worker pool is full. The alert still sends; only log/event enrichment is skipped. Reduce pressure with grouping, a longer muteSeconds, or behavior.disableLogCollection: true.

Receiver is rejecting Alertmanager webhooks¶

Symptom: Alertmanager sends webhooks to /api/v1/alerts, but they are rejected with 401 Unauthorized or 503 Service Unavailable.

Check:

Is the receiver enabled?

kubectl get cm alertkube-config -o yaml | grep -A 2 receiver:

If a token is configured, Alertmanager must send Authorization: Bearer <token>.

curl -X POST http://localhost:9090/api/v1/alerts \
  -H "Authorization: Bearer $(kubectl get secret alertkube -o jsonpath='{.data.receiverToken}' | base64 -d)" \
  -H "Content-Type: application/json" \
  -d '{"alerts": []}'

If you get 503, the controller has not installed the handler yet; check pod logs and informer sync.

High churn on resolved alerts¶

Symptom: alertkube_alerts_total with severity=resolved is rising rapidly, or PagerDuty incidents are closing and re-opening repeatedly.

resolveTTLSeconds is probably too short for the workload. It must be greater than 300 seconds and should usually be close to or above muteSeconds.

```bash
kubectl get cm alertkube-config -o yaml | grep resolveTTLSeconds
```

Health checks¶

Is the controller ready to serve traffic?¶

curl -s http://localhost:9090/readyz
# Returns 200 OK once informer caches sync
# Returns 503 Service Unavailable while syncing or in follower mode (leader-election)

Is the controller alive?¶

curl -s http://localhost:9090/healthz
# Always returns 200 OK once the HTTP server starts

Are the API endpoints accessible?¶

# /api/alerts (read-only alerts introspection)
curl -s http://localhost:9090/api/alerts | jq .

# /api/v1/alerts (Alertmanager webhook receiver)
curl -X POST http://localhost:9090/api/v1/alerts \
  -H "Content-Type: application/json" \
  -d '{"alerts": []}'

Troubleshoot with Metrics¶

Essential Metrics¶

Common Scenarios¶

Nothing is alerting¶

Alerts are firing but not reaching Slack/PagerDuty¶

Alerts are being suppressed unexpectedly¶

Sinks are getting rate-limited¶

Pod enrichment is being skipped¶

Receiver is rejecting Alertmanager webhooks¶

High churn on resolved alerts¶

Health checks¶

Is the controller ready to serve traffic?¶

Is the controller alive?¶

Are the API endpoints accessible?¶

See Also¶