Velero · 集群备份与恢复

Velero 是这套集群的备份恢复组件，跑在 velero 命名空间，被 README 归在 monitoring 部署批次里。它由一个单副本 server Deployment（执行备份/恢复编排、调度 Schedule）和一个覆盖全节点的 node-agent DaemonSet（用 kopia 做文件级卷数据上传）组成，把一批命名空间的 Kubernetes 资源和卷数据逐一定时备份到集群内 MinIO 的 velero-backups 桶。

对象存储用集群内 MinIO（S3 兼容，provider: aws），不是真正的 AWS；所有凭据通过 ExternalSecret 从 Vault 拉取，git 里没有明文。

部署形态

该目录不是 kustomize root（没有 kustomization.yaml），镜像版本直接写在容器 image 字段上。命名空间 velero 打了 pod-security.kubernetes.io/enforce: privileged（node-agent 需要 hostPath 访问宿主机 pod 目录）。

Name
命名空间
Description
velero，带 ResourceQuota（上限 15 pods / requests 2 CPU·3Gi / limits 6 CPU·8Gi）和容器级 + Pod 级 LimitRange。
Name
server
Description
Deployment，单副本，镜像 velero/velero:v1.18.1，启动参数 server --uploader-type=kopia。一个 initContainer velero/velero-plugin-for-aws:v1.14.1 把 AWS 对象存储插件装进共享 plugins 卷。requests 150m·128Mi / limits 500m·512Mi。指标端口 :8085。
Name
node-agent
Description
DaemonSet，每节点一个，镜像同为 velero/velero:v1.18.1，启动参数 node-agent server，runAsUser: 0，以 HostToContainer 挂载宿主机 /var/lib/kubelet/pods 到 /host_pods，负责文件级卷数据上传。requests 100m·128Mi / limits 500m·512Mi。更新策略 maxSurge: 0 / maxUnavailable: 1。
Name
调度
Description
server 走 nodeSelector: workload=infra（即落在 worker3，README 记 infra 组件 2026-06 起统一用 workload: infra，PR #888）。node-agent 是 DaemonSet，无 nodeSelector，覆盖全部节点。
Name
RBAC
Description
ServiceAccount velero 绑定到 cluster-admin ClusterRole（备份/恢复任意命名空间资源需要集群级权限）。
Name
DNS
Description
server 与 node-agent 都显式配置 dnsConfig，nameserver 指向 10.43.0.10（集群 CoreDNS）。

配置与依赖

Velero 不依赖独立的 ConfigMap，行为由 CR（BackupStorageLocation/Schedule/VolumeSnapshotLocation）和环境变量驱动。

凭据（ExternalSecret）：cloud-credentials 从 Vault velero/cloud-credentials 拉 AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/cloud（cloud 是挂到 /credentials/cloud 的凭据文件，由 AWS_SHARED_CREDENTIALS_FILE 等环境变量指向）；velero-repo-credentials 从 Vault velero/repo-credentials 拉 repository-password（kopia 仓库密码）。两者都用 ClusterSecretStore vault-backend，刷新间隔 1h。
对象存储依赖（MinIO）：所有 BackupStorageLocation 都 provider: aws、s3ForcePathStyle: "true"，endpoint http://minio.minio.svc.cluster.local:9000，publicUrl: https://minio-api.yldm.tech，region us-east-1，统一桶 velero-backups，按命名空间分 prefix。
卷快照（VolumeSnapshotLocation）：定义两个——default（provider: csi，CSI 快照）和 restic（provider: velero.io/restic，作为 fallback）。
关键 env：server 设 AWS_REGION/AWS_DEFAULT_REGION=us-east-1、VELERO_SCRATCH_DIR=/scratch、LD_LIBRARY_PATH=/plugins、AWS_SDK_GO_LOG_LEVEL=debug，并把 GOOGLE_APPLICATION_CREDENTIALS/AZURE_CREDENTIALS_FILE/ALIBABA_CLOUD_CREDENTIALS_FILE 也都指向同一个 /credentials/cloud 文件。

备份存储与定时任务

备份按命名空间一一对应：每个被保护的命名空间都有一条独立的 BackupStorageLocation（同桶不同 prefix）和一条 Schedule。共 18 个命名空间，定时任务从凌晨 1:00 起每隔半小时错峰一个，避免同时跑。所有 Schedule 都设 includeClusterResources: false，只备命名空间内资源。

命名空间	备份时间 (cron)	保留 (ttl)	卷快照
production	`0 1 * * *`	720h (30 天)	是
vault	`30 1 * * *`	720h	是
grafana	`0 2 * * *`	168h (7 天)	是
prometheus	`30 2 * * *`	168h	是
loki	`0 3 * * *`	168h	是
argocd	`30 3 * * *`	720h	是
external-secrets	`0 4 * * *`	720h	是
minio	`30 4 * * *`	720h	是
velero	`0 5 * * *`	720h	是
cert-manager	`0 6 * * *`	720h	否
dex	`30 6 * * *`	720h	否
cloudflare-tunnel	`0 7 * * *`	168h	否
external-dns	`30 7 * * *`	168h	否
kubernetes-dashboard	`0 8 * * *`	168h	否
kyverno	`30 8 * * *`	720h	否
metallb-system	`0 9 * * *`	720h	否
nfs-provisioner	`30 9 * * *`	720h	否

注：表中"卷快照"对应 Schedule 的 snapshotVolumes——有持久数据的命名空间（前 9 个）开启，纯配置/控制面组件关闭。运维细节参见仓库 docs/backup-and-restore-runbook.md。

访问与监控

Velero 没有对外 Ingress，操作走 velero CLI / kubectl，是内部基础设施组件。

指标 Service：velero-metrics（ClusterIP），selector: deploy=velero，暴露 :8085。server pod 同时带 prometheus.io/scrape 注解，指向 :8085/metrics。目录里没有 ServiceMonitor/PrometheusRule/HPA/VPA/PDB。
网络策略：命名空间默认 deny ingress，按需放行——allow-minio-access（egress 到 minio 命名空间 :9000/:9001）、allow-prometheus-scraping（ingress 自 prometheus 命名空间 :8085）、allow-kube-api-access、allow-dns-egress、allow-external-egress、allow-same-namespace。

返回 infrastructure 总览

Velero · 集群备份与恢复

部署形态

配置与依赖

备份存储与定时任务

访问与监控

评论