Backup & Disaster Recovery

Backup & Disaster Recovery

A single-node cluster gets its resilience from tested restore drills, not redundant infrastructure. Three pieces:

  1. Velero for all Kubernetes objects + PVC contents.
  2. etcd-defrag weekly CronJob to keep etcd’s database compact.
  3. Forgejo backups for the Git source of truth.

Velero

Velero is deployed via Helm in the velero namespace. It backs up to an off-host S3-compatible store (the homelab MinIO + a third-party S3 bucket for true off-site).

# Manual backup
velero backup create manual-$(date +%F) --wait
 
# List + inspect
velero backup get
velero backup describe <name>

Schedules live in the chart values; daily backups retain for 30 days, weekly for 90. A periodic restore drill runs as a CronJob — it restores the most recent backup into a throwaway namespace and verifies the manifests reapply cleanly. If the drill fails, an alert fires.

Velero gotchas (learned the hard way)

  • Chart + appVersion + plugin image versions must align. A mismatched plugin (e.g., velero-plugin-for-aws:v1.10 against velero:v1.13) silently produces empty backups.
  • upgradeCRDs: true on chart bumps. Without it, new fields in Backup/Restore CRDs are dropped on apply.
  • Exclude nginx runtime emptyDirs from filesystem backups — the tmp/ dirs in ingress-nginx-controller change every second and saturate the backup throughput.
  • Distroless kubectl image needs glibc. The default kubectl image in some chart versions uses Chainguard Wolfi (musl); some Velero hooks need glibc. Override to alpine/k8s or chainguard/kubectl:latest-glibc.

etcd-defrag CronJob

Etcd’s database fragments over time; sustained writes inflate disk usage even when the logical content is unchanged. A CronJob runs weekly:

schedule: "0 5 * * 0"   # Sundays 05:00 UTC
command:
  - etcd-defrag
  - --endpoints=https://127.0.0.1:2379
  - --cacert=/etc/kubernetes/pki/etcd/ca.crt
  - --cert=/etc/kubernetes/pki/etcd/server.crt
  - --key=/etc/kubernetes/pki/etcd/server.key
  - --cluster

Mounts the kubeadm PKI directory read-only via hostPath. Tolerates control-plane NoSchedule taint. Failed runs trigger KubeJobFailed unless cleaned up.

Forgejo backups

forgejo-backup.timer (systemd user unit on the host) runs daily at 03:00 and dumps data/forgejo/ + a Postgres backup of forgejo-db to a tarball outside the Forgejo container’s mount tree. Retention: last 14 daily + last 8 weekly.

The backup script excludes data/forgejo/{queues,indexers,tmp} — those regenerate on Forgejo restart and inflate the tarball.

What recovery looks like

For a “the SSD died” scenario:

  1. Provision a new Ubuntu host, copy the kubeadm bootstrap script.
  2. kubeadm init with the same pod CIDR.
  3. Apply the bootstrap Secret (ArgoCD + KSOPS age key) and the ArgoCD Helm release.
  4. ArgoCD self-applies, then pulls every Application — the cluster reconstructs itself from Git.
  5. velero restore the most recent backup.
  6. Bring Forgejo’s docker-compose stack back up, restore the Postgres dump + data/forgejo/.

The drill takes ~45 minutes on tested hardware. The first 30 are mostly waiting for ArgoCD’s sync loops; the last 15 are Forgejo + Postgres restore.