Observability Stack
Three signals (metrics, logs, traces-via-stdout), one Grafana, one notification target.
apps + cluster components
│ /metrics stdout (JSON) alerts
▼ ▼ ▼
ServiceMonitors ──► Prometheus Alloy ──► Loki Alertmanager ──► ntfy.sh
│ │ (DaemonSet) │
└────────────────► Grafana ◄──────────────────────┘
(dashboards as ConfigMaps)Everything lives in the monitoring namespace under PodSecurityStandards
enforce: privileged — node-exporter needs hostPath access that
baseline doesn’t allow.
kube-prometheus-stack
Standard chart, version-pinned in
infrastructure/monitoring/kube-prometheus-stack/values.yaml. Notable
overrides:
serviceMonitorSelectorNilUsesHelmValues: false— Prometheus picks up ServiceMonitors from any namespace, not just chart-labeled ones.scrapeInterval: 30s— half the chart default; for a homelab the extra resolution doesn’t justify doubling on-disk volume.retention: 15d,retentionSize: 20GB— bounded by the 30 GiB PVC.defaultRules.disabled—etcdMembersDownandetcdInsufficientMembersoff because they have no meaning on a single-node cluster.grafana.initChownData.enabled: false— the chart’s init container CrashLoopBackOffs after Grafana renders any CSV/PDF/PNG (subdirs at mode 0700 +drop ALL, add [CHOWN]= noCAP_DAC_OVERRIDE= chown can’t recurse).fsGroup: 472already handles ownership at mount.
Loki SingleBinary + Alloy
Loki runs in SingleBinary mode (one StatefulSet pod, embedded compactor + ingester + querier). Alloy is a DaemonSet — one pod per node — that follows every container log file via the Kubernetes API and ships to Loki’s push endpoint.
A specific Loki 3.x gotcha: when retention_enabled: true, the chart
requires compactor.delete_request_store: filesystem or the compactor
crashloops with “compactor.delete-request-store should be configured when
retention is enabled.”
Alertmanager → ntfy.sh
Native webhook_configs straight to ntfy.sh, no sidecar translator. The
ntfy topic URL is SOPS-encrypted in
infrastructure/monitoring/alertmanager-ntfy/url-secret.sops.yaml with
two keys:
url-default— forseverity=warning(4h repeat)url-critical— forseverity=critical(1h repeat)
Priority, title, and tags are encoded as URL query parameters on the
ntfy URL itself, not HTTP headers. The Prometheus Operator’s config
validator rejects webhook_configs.http_config.headers (“field headers
not found in type config.plain”) and silently falls back to the null
receiver — every alert is dropped. Encoding via query params is the
documented ntfy.sh way around it.
ServiceMonitor selector gotcha
The Prometheus Operator’s ServiceMonitor.spec.selector.matchLabels
matches against the target Service’s labels, not its name. Helm
charts (notably argo-cd 9.5) sometimes give metrics Services labels that
don’t match their names:
| Service name | app.kubernetes.io/name label |
|---|---|
argocd-server-metrics | argocd-server-metrics ✓ matches |
argocd-repo-server-metrics | argocd-repo-server-metrics (NOT argocd-repo-server) |
argocd-application-controller-metrics | argocd-metrics (NOT the name) |
A guessed selector lands the target in droppedTargets with no logged
error. The validation script (next section) catches this.
Validation script
scripts/validate-observability.sh
checks pods, scrape targets, ServiceMonitors, PVC usage, RAM/CPU budget,
and (with --send-test-alert) end-to-end ntfy delivery. It uses the
Prometheus pod as a probe relay so it works against distroless Loki and
Alloy images that have no wget of their own.
./scripts/validate-observability.sh
# 28 passed, 0 failed → Observability stack healthy