infra: weekly docker + gitea-archive prune cron on the CI droplet

- 2026-06-10 incident: 48G disk hit 100% — pipelines died w. ENOSPC in
  the UT/IT dependency install + the Woodpecker repo UI 500'd (SQLite
  couldn't write; the repos LIST still rendered, pure read). Culprits:
  ~15G stale CI images/build cache + 19G of gitea repo-archive cache
  (crawlers on the public instance generate zip/tar snapshots faster
  than gitea's @midnight archive_cleanup reclaims them)
- cicd/docker-prune.sh → /etc/cron.weekly/docker-prune (installed live
  via ssh too — running the full playbook clobbers the droplet's nginx
  config, known template bug): image/builder/container prune keeping
  the last 168h + a repo-archive cache wipe
- gitea container restarted to unwedge the post-receive queues the
  disk-full era jammed (pushes landed in the bare repo but the UI +
  Woodpecker webhook never heard about them)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Disco DeDisco
2026-06-10 13:12:42 -04:00
parent 8130122b1b
commit 1536feaf21
2 changed files with 31 additions and 0 deletions

View File

@@ -101,6 +101,16 @@
creates: /etc/letsencrypt/live/gitea.earthmanrpg.me/fullchain.pem
become: true
# Weekly docker + gitea-archive-cache prune — the 2026-06-10 disk-full
# incident (48G/48G: pipelines ENOSPC'd, Woodpecker repo UI 500'd) was
# ~15G of stale CI images + 19G of crawler-generated repo archives.
- name: Install weekly docker/archive prune cron
ansible.builtin.copy:
src: cicd/docker-prune.sh
dest: /etc/cron.weekly/docker-prune
mode: "0755"
become: true
- name: Run docker compose -f /opt/cicd/docker-compose.yaml up -d
ansible.builtin.command:
cmd: docker compose -f /opt/cicd/docker-compose.yaml up -d

View File

@@ -0,0 +1,21 @@
#!/bin/sh
# Weekly CI-droplet hygiene — installed to /etc/cron.weekly/docker-prune by
# cicd-playbook.yaml.
#
# Every Woodpecker pipeline leaves freshly-built/pulled image layers behind
# (~1GB+/month); unchecked they filled the 48G disk on 2026-06-10 — pipelines
# died with "no space left on device" and the Woodpecker repo UI 500'd
# (SQLite could no longer write). Keep anything used within the last week;
# the python-tdd-ci image re-pulls from the LOCAL Gitea registry in seconds
# if it gets pruned between pipelines.
docker image prune -af --filter "until=168h"
docker builder prune -af --filter "until=168h"
docker container prune -f --filter "until=168h"
# Gitea's repo-archive cache (zip/tar snapshots, regenerated on demand) grew
# to 19G in the same incident — crawlers hitting the public instance's
# download links generate archives faster than gitea's own @midnight
# archive_cleanup cron reclaims them. It is pure cache; wipe it weekly.
rm -rf /opt/cicd/data/gitea/gitea/repo-archive/* 2>/dev/null
exit 0