Observability Stack

Observability Stack

Three signals (metrics, logs, traces-via-stdout), one Grafana, one notification target.

apps + cluster components
     │  /metrics            stdout (JSON)              alerts
     ▼                      ▼                         ▼
ServiceMonitors ──► Prometheus    Alloy ──► Loki  Alertmanager ──► ntfy.sh
     │                      │      (DaemonSet)        │
     └────────────────► Grafana ◄──────────────────────┘
                   (dashboards as ConfigMaps)

Everything lives in the monitoring namespace under PodSecurityStandards enforce: privileged — node-exporter needs hostPath access that baseline doesn’t allow.

kube-prometheus-stack

Standard chart, version-pinned in infrastructure/monitoring/kube-prometheus-stack/values.yaml. Notable overrides:

  • serviceMonitorSelectorNilUsesHelmValues: false — Prometheus picks up ServiceMonitors from any namespace, not just chart-labeled ones.
  • scrapeInterval: 30s — half the chart default; for a homelab the extra resolution doesn’t justify doubling on-disk volume.
  • retention: 15d, retentionSize: 20GB — bounded by the 30 GiB PVC.
  • defaultRules.disabledetcdMembersDown and etcdInsufficientMembers off because they have no meaning on a single-node cluster.
  • grafana.initChownData.enabled: false — the chart’s init container CrashLoopBackOffs after Grafana renders any CSV/PDF/PNG (subdirs at mode 0700 + drop ALL, add [CHOWN] = no CAP_DAC_OVERRIDE = chown can’t recurse). fsGroup: 472 already handles ownership at mount.

Loki SingleBinary + Alloy

Loki runs in SingleBinary mode (one StatefulSet pod, embedded compactor + ingester + querier). Alloy is a DaemonSet — one pod per node — that follows every container log file via the Kubernetes API and ships to Loki’s push endpoint.

A specific Loki 3.x gotcha: when retention_enabled: true, the chart requires compactor.delete_request_store: filesystem or the compactor crashloops with “compactor.delete-request-store should be configured when retention is enabled.”

Alertmanager → ntfy.sh

Native webhook_configs straight to ntfy.sh, no sidecar translator. The ntfy topic URL is SOPS-encrypted in infrastructure/monitoring/alertmanager-ntfy/url-secret.sops.yaml with two keys:

  • url-default — for severity=warning (4h repeat)
  • url-critical — for severity=critical (1h repeat)

Priority, title, and tags are encoded as URL query parameters on the ntfy URL itself, not HTTP headers. The Prometheus Operator’s config validator rejects webhook_configs.http_config.headers (“field headers not found in type config.plain”) and silently falls back to the null receiver — every alert is dropped. Encoding via query params is the documented ntfy.sh way around it.

ServiceMonitor selector gotcha

The Prometheus Operator’s ServiceMonitor.spec.selector.matchLabels matches against the target Service’s labels, not its name. Helm charts (notably argo-cd 9.5) sometimes give metrics Services labels that don’t match their names:

Service nameapp.kubernetes.io/name label
argocd-server-metricsargocd-server-metrics ✓ matches
argocd-repo-server-metricsargocd-repo-server-metrics (NOT argocd-repo-server)
argocd-application-controller-metricsargocd-metrics (NOT the name)

A guessed selector lands the target in droppedTargets with no logged error. The validation script (next section) catches this.

Validation script

scripts/validate-observability.sh checks pods, scrape targets, ServiceMonitors, PVC usage, RAM/CPU budget, and (with --send-test-alert) end-to-end ntfy delivery. It uses the Prometheus pod as a probe relay so it works against distroless Loki and Alloy images that have no wget of their own.

./scripts/validate-observability.sh
# 28 passed, 0 failed → Observability stack healthy