gocached: do per-shard usage stats and incremental cleanup#36
Conversation
tomhjp
left a comment
There was a problem hiding this comment.
I have to admit I got a bit fatigued by the time I got to the tests, so I haven't done much review there. Overall though I like the shape of the changes to stats tracking + eviction
| // runShardStatsLoop keeps usage stats fresh by rescanning the | ||
| // oldest-scanned shard whenever it crosses shardStalenessTarget. On a quiet | ||
| // cache the goroutine sleeps for long stretches instead of spinning every | ||
| // few seconds; on a busy cache the loop is effectively rate-limited by | ||
| // shardStatsMinInterval since each scan resets the oldest age to zero. |
There was a problem hiding this comment.
Is this comment accurate? I don't see any reason why the cache being busy would make us scan more.
There was a problem hiding this comment.
yeah, not accurate. rewording.
|
|
||
| // shardStatsMu guards shardStats and serializes recomputeAggregateLocked. | ||
| shardStatsMu sync.Mutex | ||
| shardStats map[string]*shardSnapshot // keyed by hex prefix; e.g. "00".."ff" |
There was a problem hiding this comment.
nit: It would be nice to have a tiny type for shard prefix
| type shardDelta struct { | ||
| count atomic.Int64 | ||
| bytes atomic.Int64 | ||
| } |
There was a problem hiding this comment.
It seems like an easy win for consistency to make the fields plain ints and then atomically load a whole shardDelta value. Otherwise the values can drift arbitrarily far apart from each other over time.
There was a problem hiding this comment.
I just put a little mutex around these and made them plain ints
| if preCount != 0 { | ||
| srv.shardDeltas[idx].count.Add(-preCount) | ||
| } | ||
| if preBytes != 0 { | ||
| srv.shardDeltas[idx].bytes.Add(-preBytes) | ||
| } |
There was a problem hiding this comment.
Wouldn't it be more correct to decrement by the value in stats? This seems approximately correct but could drift over time.
There was a problem hiding this comment.
done. changed to decrement.
| // move on. No delta, no Blob row, no file. | ||
| continue | ||
| } | ||
| // Each Action contributes (1, storedSize) to the Blobs⋈Actions |
There was a problem hiding this comment.
Oh is the symbol for inner join, or left join? Let's be more explicit to prevent confusion
The usageStats() LEFT JOIN every 5 minutes didn't scale. At 272M
Actions it took ~7 minutes per call, pinning a reader snapshot ~60% of
the time. The WAL grew to 52 GB because SQLite's PASSIVE
autocheckpoint could never advance past the snapshot; walFindFrame
pegged CPU and go-cacher clients stalled. Startup also blocked on
usageStats + cleanOldObjects for 10+ minutes. cleanCandidates was the
same shape: GROUP BY scaling with total rows.
This redoes the usage stats + LRU cleanup pipeline.
Shard the histogram, persist it, refresh in the background:
* New BlobShardStats(Prefix, ScannedAt, StatsJSON) table.
* 256 shards keyed on a SHA256 hex prefix (--shard-prefix-len=1..4).
* runShardStatsLoop picks the oldest shard, sleeps until it crosses
shardStalenessTarget (10m), then rescans.
* Per-shard scan is one SUM(CASE WHEN…) query scoped by
SHA256 >= ? AND SHA256 < ?. No row streaming.
* On restart, loadShardStats seeds lastUsage from the persisted rows.
Server starts up to usable right away without a blocking step.
Action-LRU cleanup driven by idx_actions_access:
* evictOldestActions walks the access-time index for the N oldest
stale Actions, deletes them, orphan-deletes Blobs.
* INDEXED BY locks the plan; TestEvictionQueryPlan{,_atScale} asserts
via EXPLAIN QUERY PLAN.
* Action-LRU instead of Blob-LRU: a stale Action is evicted even if
its Blob has fresher siblings. Equivalent for the common 1:1 case;
stricter LRU on shared content.
* One short Tx per 200-row batch on a 1s tick. No multi-minute reader
snapshots.
Dead-reckon PUTs and evictions between scans:
* Per-shard mutex-guarded counters track (count, bytes) added or
removed since the shard was last persisted.
* cleanupTick includes the delta in the size pressure check so a
burst of PUTs between scans triggers cleanup immediately.
Observability: /usage shows cohort table, shard freshness histogram,
per-shard scan duration p25/p50/p90; new gauges
gocached_shard_stats_{unscanned,oldest_age_seconds}, pending blob
count/bytes; new histogram gocached_shard_scan_duration_seconds
replayed from persisted data at startup; new counter
gocached_evicted_actions.
Schema migration is additive (schemaVersion stays 4).
Perf test (TestPerfQueries) always runs but defaults to 100 rows so
normal `go test` is sub-second. Operators set --performance-test-rows
larger for real scale; seed reused across runs via
--performance-test-dir. Sample, cumulative across rows:
rows DB scanShard evict legacy GROUP BY ratio usageStats
1K 396K 188µs 5.2ms 648µs 0.1x 321ns
10K 3.4M 248µs 5.5ms 6.4ms 1.2x 227ns
100K 35M 1.8ms 6.1ms 73ms 12x 328ns
1M 349M 22ms 7.5ms 763ms 102x 231ns
2M 698M 51ms 8.2ms 1.6s 196x 261ns
5M 1.8G 162ms 8.7ms 4.1s 475x 224ns
10M 3.5G 349ms 9.0ms 8.3s 922x 249ns
usageStats and evictOldestActions are flat in N. scanShard scales with
per-shard rows. The "legacy GROUP BY" column is the cleanCandidates
query gocached used before this branch (since deleted); it scales
linearly with total rows, the behavior that was wedging prod at
250M.
Updates tailscale/corp#42670
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
This fixes performance problems we hit in with gocached's stats + LRU cleanup queries.
Three problems previously:
usageStats+cleanOldObjectsfor over 10 min before theserver began accepting requests.
usageStatspinned a reader snapshot ~60% of the time, so SQLite's PASSIVEautocheckpoint could never advance. The WAL grew to 52 GB.
walFindFramepegged CPU and
go-cacherclients stalled.cleanCandidatesGROUP BYscaled with TOTAL rows; size-pressure cleanupwas effectively unbounded work, and the same snapshot-pinning issue.
Instead, share the keyspace (by default: into 256 shards, by two hex digit prefix) and compute usage stats by shard. And store those stats (+ computation time) in the DB itself, so the server starts up quickly after a restart, without blocking.
Second: change the blob cleanup query to use an index, and switch from LRU by blobs to LRU by actions, which can delete blobs when the refcount drops to zero.
Then add more metrics.
No schema version bump; just one new table that's created if it doesn't already exist.
Updates tailscale/corp#42670