Day-2 Operations

The runbook lives in docs/operations-runbook.md; this page is the executive summary plus the patterns you’ll recognize when something breaks.

Cadence

When	What
Daily	Nothing manual. Alerts page; the absence of an alert is the green light.
Weekly	Glance at the Grafana homelab-overview dashboard. `kubectl top pods -n monitoring`.
Monthly	PVC usage check on Prometheus + Loki. Cert health. Host disk hygiene (containerd image cache).
Quarterly	Bump Helm chart versions (kube-prometheus-stack, Loki, ingress-nginx). Review noisy alerts.
Annual	Renew kubeadm certs (auto-done by timer < 30d, but worth a manual sanity check).

Troubleshooting cheat sheet

Symptom	Likely cause	First check
`kubectl` returns “credentials” error	`admin.conf` cert expired	`kubeadm certs check-expiration`
ntfy spam on boot	Chronic alert backlog catching up	Alertmanager `/api/v2/alerts`
Grafana CrashLoopBackOff after reboot	initChownData container hit `EACCES` on 0700 dirs	`kubectl logs … -c init-chown-data`
Loki query empty	Alloy not running, or Loki not ready	`loki_get /ready` via Prometheus probe
Alert fires but no ntfy	`http_config.headers` regression in Alertmanager	Check receiver uses `url_file:` only
`KubeJobFailed` in kube-system	etcd-defrag CronJob ran but no etcd-defrag pod left to inspect	`kubectl describe cronjob etcd-defrag`
kube-proxy in CrashLoopBackOff	`fs.inotify.max_user_instances` too low	`sysctl fs.inotify.max_user_instances`
ServiceMonitor created but no scrape	Selector matches Service name not labels	Inspect `droppedTargets` in Prometheus
Pod stuck `Pending` with no events	`LimitRange` or `ResourceQuota` violation	`kubectl describe pod`

The “should I touch this?” decision tree

Is it noisy + has no underlying problem?
    └─► Disable the rule (defaultRules.disabled).
Is it noisy + has an underlying problem?
    └─► Fix the underlying problem.
        (Don't silence indefinitely — silences expire.)
Is it firing + you care?
    └─► Use Alertmanager silence with a deadline,
        not Prometheus rule disable.
Is it firing + you don't care?
    └─► You will care later. Either tune the threshold
        or route it to a low-priority receiver.

Validation script

Health-check the whole observability stack end-to-end:

./scripts/validate-observability.sh
# Optionally, also fire + auto-resolve a test alert through ntfy:
./scripts/validate-observability.sh --send-test-alert

Returns non-zero on any failure. Suitable for cron / a Forgejo scheduled action.

kubectl shortcuts worth memorizing

# What's not Running?
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
 
# What's the unsilenced firing list?
kubectl -n monitoring exec alertmanager-… -c alertmanager -- \
  wget -qO- 'http://localhost:9093/api/v2/alerts?active=true&silenced=false&inhibited=false'
 
# What's Argo not happy about?
kubectl -n argocd get applications -o json | \
  jq '.items[] | select(.status.sync.status != "Synced" or .status.health.status != "Healthy") | .metadata.name'

When in doubt, look at memory

There’s a ~/.claude/projects/-home-mike-projects-sky-palette/memory/ directory with feedback files like:

feedback_kubeadm_localhost_metrics_bind.md
feedback_grafana_initchowndata_eacces.md
feedback_prometheus_servicemonitor_label_vs_name.md
feedback_ubuntu_desktop_timesyncd_silent.md

Each entry is a single previous-incident-as-recipe. Most of the entries in the troubleshooting table above started life there.

Application Patterns How This Site Is Hosted