Day-2 Operations

Day-2 Operations

The runbook lives in docs/operations-runbook.md; this page is the executive summary plus the patterns you’ll recognize when something breaks.

Cadence

WhenWhat
DailyNothing manual. Alerts page; the absence of an alert is the green light.
WeeklyGlance at the Grafana homelab-overview dashboard. kubectl top pods -n monitoring.
MonthlyPVC usage check on Prometheus + Loki. Cert health. Host disk hygiene (containerd image cache).
QuarterlyBump Helm chart versions (kube-prometheus-stack, Loki, ingress-nginx). Review noisy alerts.
AnnualRenew kubeadm certs (auto-done by timer < 30d, but worth a manual sanity check).

Troubleshooting cheat sheet

SymptomLikely causeFirst check
kubectl returns “credentials” erroradmin.conf cert expiredkubeadm certs check-expiration
ntfy spam on bootChronic alert backlog catching upAlertmanager /api/v2/alerts
Grafana CrashLoopBackOff after rebootinitChownData container hit EACCES on 0700 dirskubectl logs … -c init-chown-data
Loki query emptyAlloy not running, or Loki not readyloki_get /ready via Prometheus probe
Alert fires but no ntfyhttp_config.headers regression in AlertmanagerCheck receiver uses url_file: only
KubeJobFailed in kube-systemetcd-defrag CronJob ran but no etcd-defrag pod left to inspectkubectl describe cronjob etcd-defrag
kube-proxy in CrashLoopBackOfffs.inotify.max_user_instances too lowsysctl fs.inotify.max_user_instances
ServiceMonitor created but no scrapeSelector matches Service name not labelsInspect droppedTargets in Prometheus
Pod stuck Pending with no eventsLimitRange or ResourceQuota violationkubectl describe pod

The “should I touch this?” decision tree

Is it noisy + has no underlying problem?
    └─► Disable the rule (defaultRules.disabled).
Is it noisy + has an underlying problem?
    └─► Fix the underlying problem.
        (Don't silence indefinitely — silences expire.)
Is it firing + you care?
    └─► Use Alertmanager silence with a deadline,
        not Prometheus rule disable.
Is it firing + you don't care?
    └─► You will care later. Either tune the threshold
        or route it to a low-priority receiver.

Validation script

Health-check the whole observability stack end-to-end:

./scripts/validate-observability.sh
# Optionally, also fire + auto-resolve a test alert through ntfy:
./scripts/validate-observability.sh --send-test-alert

Returns non-zero on any failure. Suitable for cron / a Forgejo scheduled action.

kubectl shortcuts worth memorizing

# What's not Running?
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
 
# What's the unsilenced firing list?
kubectl -n monitoring exec alertmanager-… -c alertmanager -- \
  wget -qO- 'http://localhost:9093/api/v2/alerts?active=true&silenced=false&inhibited=false'
 
# What's Argo not happy about?
kubectl -n argocd get applications -o json | \
  jq '.items[] | select(.status.sync.status != "Synced" or .status.health.status != "Healthy") | .metadata.name'

When in doubt, look at memory

There’s a ~/.claude/projects/-home-mike-projects-sky-palette/memory/ directory with feedback files like:

  • feedback_kubeadm_localhost_metrics_bind.md
  • feedback_grafana_initchowndata_eacces.md
  • feedback_prometheus_servicemonitor_label_vs_name.md
  • feedback_ubuntu_desktop_timesyncd_silent.md

Each entry is a single previous-incident-as-recipe. Most of the entries in the troubleshooting table above started life there.