Skip to content

gocached: bound WAL size and expose db/wal size gauges#35

Merged
bradfitz merged 1 commit into
mainfrom
bradfitz/wal
Jun 1, 2026
Merged

gocached: bound WAL size and expose db/wal size gauges#35
bradfitz merged 1 commit into
mainfrom
bradfitz/wal

Conversation

@bradfitz

@bradfitz bradfitz commented Jun 1, 2026

Copy link
Copy Markdown
Owner

One of our cigocached servers' WAL had grown to 46 GB over ~87 days of
uptime, with walFindFrame pegging CPU and stalling go-cacher clients on
the CI runners. A CI vet job that normally runs in ~100s on a laptop
was sitting at 30+ minutes.

SQLite's autocheckpoint runs only PASSIVE checkpoints, which reuse WAL
space in place but never shrink the file on disk.

Fix:

  • Set journal_size_limit=1 GiB per connection so even passive
    checkpoints will truncate down to that bound when they can.
  • Run wal_checkpoint(TRUNCATE) once at startup (this is what shrinks
    the existing 46 GB WAL on the running server after deploy) and
    periodically (every minute) from a background goroutine, logging
    any partial checkpoints that hint at a long-running reader.
  • Run a final TRUNCATE checkpoint on Close.
  • Bump modernc.org/sqlite v1.45.0 -> v1.51.0 for general fixes
    accumulated since February.

Also add two new gauges for ongoing visibility:

gocached_sqlite_data_bytes  size of the main .db file
gocached_sqlite_wal_bytes   size of the .db-wal file

Updates tailscale/corp#42670

One of our cigocached servers' WAL had grown to 46 GB over ~87 days of
uptime, with walFindFrame pegging CPU and stalling go-cacher clients on
the CI runners. A CI vet job that normally runs in ~100s on a laptop
was sitting at 30+ minutes.

SQLite's autocheckpoint runs only PASSIVE checkpoints, which reuse WAL
space in place but never shrink the file on disk.

Fix:
 - Set journal_size_limit=1 GiB per connection so even passive
   checkpoints will truncate down to that bound when they can.
 - Run wal_checkpoint(TRUNCATE) once at startup (this is what shrinks
   the existing 46 GB WAL on the running server after deploy) and
   periodically (every minute) from a background goroutine, logging
   any partial checkpoints that hint at a long-running reader.
 - Run a final TRUNCATE checkpoint on Close.
 - Bump modernc.org/sqlite v1.45.0 -> v1.51.0 for general fixes
   accumulated since February.

Also add two new gauges for ongoing visibility:

    gocached_sqlite_data_bytes  size of the main .db file
    gocached_sqlite_wal_bytes   size of the .db-wal file

Updates tailscale/corp#42670

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
@bradfitz bradfitz requested a review from tomhjp June 1, 2026 22:10
@bradfitz

bradfitz commented Jun 1, 2026

Copy link
Copy Markdown
Owner Author

cc @tomhjp for post-submit review

@bradfitz bradfitz merged commit 097527a into main Jun 1, 2026
3 checks passed
@tomhjp

tomhjp commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Thanks for documenting everything inline, I learnt a lot reviewing this. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants