Day-2 Operations
The runbook lives in
docs/operations-runbook.md;
this page is the executive summary plus the patterns you’ll recognize
when something breaks.
Cadence
| When | What |
|---|---|
| Daily | Nothing manual. Alerts page; the absence of an alert is the green light. |
| Weekly | Glance at the Grafana homelab-overview dashboard. kubectl top pods -n monitoring. |
| Monthly | PVC usage check on Prometheus + Loki. Cert health. Host disk hygiene (containerd image cache). |
| Quarterly | Bump Helm chart versions (kube-prometheus-stack, Loki, ingress-nginx). Review noisy alerts. |
| Annual | Renew kubeadm certs (auto-done by timer < 30d, but worth a manual sanity check). |
Troubleshooting cheat sheet
| Symptom | Likely cause | First check |
|---|---|---|
kubectl returns “credentials” error | admin.conf cert expired | kubeadm certs check-expiration |
| ntfy spam on boot | Chronic alert backlog catching up | Alertmanager /api/v2/alerts |
| Grafana CrashLoopBackOff after reboot | initChownData container hit EACCES on 0700 dirs | kubectl logs … -c init-chown-data |
| Loki query empty | Alloy not running, or Loki not ready | loki_get /ready via Prometheus probe |
| Alert fires but no ntfy | http_config.headers regression in Alertmanager | Check receiver uses url_file: only |
KubeJobFailed in kube-system | etcd-defrag CronJob ran but no etcd-defrag pod left to inspect | kubectl describe cronjob etcd-defrag |
| kube-proxy in CrashLoopBackOff | fs.inotify.max_user_instances too low | sysctl fs.inotify.max_user_instances |
| ServiceMonitor created but no scrape | Selector matches Service name not labels | Inspect droppedTargets in Prometheus |
Pod stuck Pending with no events | LimitRange or ResourceQuota violation | kubectl describe pod |
The “should I touch this?” decision tree
Is it noisy + has no underlying problem?
└─► Disable the rule (defaultRules.disabled).
Is it noisy + has an underlying problem?
└─► Fix the underlying problem.
(Don't silence indefinitely — silences expire.)
Is it firing + you care?
└─► Use Alertmanager silence with a deadline,
not Prometheus rule disable.
Is it firing + you don't care?
└─► You will care later. Either tune the threshold
or route it to a low-priority receiver.Validation script
Health-check the whole observability stack end-to-end:
./scripts/validate-observability.sh
# Optionally, also fire + auto-resolve a test alert through ntfy:
./scripts/validate-observability.sh --send-test-alertReturns non-zero on any failure. Suitable for cron / a Forgejo scheduled action.
kubectl shortcuts worth memorizing
# What's not Running?
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# What's the unsilenced firing list?
kubectl -n monitoring exec alertmanager-… -c alertmanager -- \
wget -qO- 'http://localhost:9093/api/v2/alerts?active=true&silenced=false&inhibited=false'
# What's Argo not happy about?
kubectl -n argocd get applications -o json | \
jq '.items[] | select(.status.sync.status != "Synced" or .status.health.status != "Healthy") | .metadata.name'When in doubt, look at memory
There’s a ~/.claude/projects/-home-mike-projects-sky-palette/memory/
directory with feedback files like:
feedback_kubeadm_localhost_metrics_bind.mdfeedback_grafana_initchowndata_eacces.mdfeedback_prometheus_servicemonitor_label_vs_name.mdfeedback_ubuntu_desktop_timesyncd_silent.md
Each entry is a single previous-incident-as-recipe. Most of the entries in the troubleshooting table above started life there.