gocached: bound WAL size and expose db/wal size gauges#35
Merged
Conversation
One of our cigocached servers' WAL had grown to 46 GB over ~87 days of
uptime, with walFindFrame pegging CPU and stalling go-cacher clients on
the CI runners. A CI vet job that normally runs in ~100s on a laptop
was sitting at 30+ minutes.
SQLite's autocheckpoint runs only PASSIVE checkpoints, which reuse WAL
space in place but never shrink the file on disk.
Fix:
- Set journal_size_limit=1 GiB per connection so even passive
checkpoints will truncate down to that bound when they can.
- Run wal_checkpoint(TRUNCATE) once at startup (this is what shrinks
the existing 46 GB WAL on the running server after deploy) and
periodically (every minute) from a background goroutine, logging
any partial checkpoints that hint at a long-running reader.
- Run a final TRUNCATE checkpoint on Close.
- Bump modernc.org/sqlite v1.45.0 -> v1.51.0 for general fixes
accumulated since February.
Also add two new gauges for ongoing visibility:
gocached_sqlite_data_bytes size of the main .db file
gocached_sqlite_wal_bytes size of the .db-wal file
Updates tailscale/corp#42670
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Owner
Author
|
cc @tomhjp for post-submit review |
Collaborator
|
Thanks for documenting everything inline, I learnt a lot reviewing this. LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One of our cigocached servers' WAL had grown to 46 GB over ~87 days of
uptime, with walFindFrame pegging CPU and stalling go-cacher clients on
the CI runners. A CI vet job that normally runs in ~100s on a laptop
was sitting at 30+ minutes.
SQLite's autocheckpoint runs only PASSIVE checkpoints, which reuse WAL
space in place but never shrink the file on disk.
Fix:
checkpoints will truncate down to that bound when they can.
the existing 46 GB WAL on the running server after deploy) and
periodically (every minute) from a background goroutine, logging
any partial checkpoints that hint at a long-running reader.
accumulated since February.
Also add two new gauges for ongoing visibility:
Updates tailscale/corp#42670