Backup & Disaster Recovery
A single-node cluster gets its resilience from tested restore drills, not redundant infrastructure. Three pieces:
- Velero for all Kubernetes objects + PVC contents.
- etcd-defrag weekly CronJob to keep etcd’s database compact.
- Forgejo backups for the Git source of truth.
Velero
Velero is deployed via Helm in the velero namespace. It backs up to an
off-host S3-compatible store (the homelab MinIO + a third-party S3 bucket
for true off-site).
# Manual backup
velero backup create manual-$(date +%F) --wait
# List + inspect
velero backup get
velero backup describe <name>Schedules live in the chart values; daily backups retain for 30 days, weekly for 90. A periodic restore drill runs as a CronJob — it restores the most recent backup into a throwaway namespace and verifies the manifests reapply cleanly. If the drill fails, an alert fires.
Velero gotchas (learned the hard way)
- Chart + appVersion + plugin image versions must align. A
mismatched plugin (e.g.,
velero-plugin-for-aws:v1.10againstvelero:v1.13) silently produces empty backups. upgradeCRDs: trueon chart bumps. Without it, new fields in Backup/Restore CRDs are dropped on apply.- Exclude nginx runtime emptyDirs from filesystem backups — the
tmp/dirs iningress-nginx-controllerchange every second and saturate the backup throughput. - Distroless kubectl image needs glibc. The default kubectl image in
some chart versions uses Chainguard Wolfi (musl); some Velero hooks need
glibc. Override to
alpine/k8sorchainguard/kubectl:latest-glibc.
etcd-defrag CronJob
Etcd’s database fragments over time; sustained writes inflate disk usage even when the logical content is unchanged. A CronJob runs weekly:
schedule: "0 5 * * 0" # Sundays 05:00 UTC
command:
- etcd-defrag
- --endpoints=https://127.0.0.1:2379
- --cacert=/etc/kubernetes/pki/etcd/ca.crt
- --cert=/etc/kubernetes/pki/etcd/server.crt
- --key=/etc/kubernetes/pki/etcd/server.key
- --clusterMounts the kubeadm PKI directory read-only via hostPath. Tolerates
control-plane NoSchedule taint. Failed runs trigger KubeJobFailed
unless cleaned up.
Forgejo backups
forgejo-backup.timer (systemd user unit on the host) runs daily at
03:00 and dumps data/forgejo/ + a Postgres backup of forgejo-db to a
tarball outside the Forgejo container’s mount tree. Retention: last 14
daily + last 8 weekly.
The backup script excludes data/forgejo/{queues,indexers,tmp} — those
regenerate on Forgejo restart and inflate the tarball.
What recovery looks like
For a “the SSD died” scenario:
- Provision a new Ubuntu host, copy the kubeadm bootstrap script.
kubeadm initwith the same pod CIDR.- Apply the bootstrap Secret (ArgoCD + KSOPS age key) and the ArgoCD Helm release.
- ArgoCD self-applies, then pulls every Application — the cluster reconstructs itself from Git.
velero restorethe most recent backup.- Bring Forgejo’s docker-compose stack back up, restore the Postgres
dump +
data/forgejo/.
The drill takes ~45 minutes on tested hardware. The first 30 are mostly waiting for ArgoCD’s sync loops; the last 15 are Forgejo + Postgres restore.