Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 112 additions & 0 deletions docs/design/ADVISOR_AUDIT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Design: Advisor decision and audit trail

Issue: #153. Decisions: ADR-0013 (advisor default posture is shadow/off).
Related: ADVISOR_SAFETY.md (#91, the safety mechanism this records),
ADVISOR.md (#126, the controller that emits events), ADVISOR_PROMOTION.md
(#154, the gate whose verdicts are logged), OBSERVABILITY.md (#86/#152, the
INFO/metrics surfaces), CONFIG.md (#85, the versioned snapshot store).

## Goal and scope

The advisor retunes deterministic knobs (active eviction policy, sampled count,
LFU log-factor and decay, ghost size, slab/encoding/compression thresholds), so
an operator must be able to answer "what did it change, why, and did it help?"
after the fact. This spec owns the durable, tamper-evident decision/audit log and
its INFO + `/metrics` projection. It is the diagnostic backbone for the #91
rollback and the record of every #154 promotion verdict. In scope: the event
schema, durability and tamper-evidence, retention, the queryable surface, and
shadow-mode emission. Out of scope: the safety mechanism itself (#91), the
promotion decision (#154), and the metric-registry transport (#86/#152).

## Design

### What an event records

- One append-only record per advisor action and per safety event. A knob-change
record carries: monotonic snapshot version (from #91/#85), wall and logical
time, knob id, from-value, to-value, the triggering expert or objective delta
(which bandit/regret expert won and by how much [cacheus-experts]
[lecar-regret-minimization-smallcache]), the replay evidence that it beat the
static baseline (the #154 margin), and the seed. Safety records cover rollback
and kill-switch trips with cause (which metric regressed, by how much, over
which window). The objective the delta is measured against is hit ratio scored
off the hot path, never a per-request shadow simulation
[hit-ratio-can-hurt-throughput].

### Tamper-evidence and durability

- The log is a hash-chained append-only journal: each record commits the prior
record's digest, so any edit or deletion in the middle breaks the chain and is
detectable on read. It is written through the same fail-closed io_uring write
path the persistence umbrella defines (PERSISTENCE.md, #58), not a side file, so
a crash cannot silently lose the tail. The chain is verified at boot and a break
is surfaced as a distinct INFO field and metric rather than panicking.

### Surfaced via INFO and /metrics

- Current advisor state lives in the native `# IronCache` INFO section (#152): the
posture (off/shadow/active per ADR-0013), the live snapshot version, the active
expert, the count of changes/rollbacks/kill-switch trips, and the last verdict.
The same counters are Prometheus series in the versioned registry (#152) under a
bounded label set (knob id from a fixed allow-list, no free-form cardinality).
The decision log is not a high-cardinality metric: `/metrics` exposes aggregate
counters and gauges, while the per-record detail is read through the query
surface, keeping the scrape cheap (the OBSERVABILITY.md cardinality rule).

### Queryable surface

- A read-only admin verb returns recent records filtered by knob, version range,
or event type (rollback/kill-switch/promotion), bounded in count like SLOWLOG.
Records are immutable; there is no mutating verb on the journal. The query path
is gated by the same auth posture as other introspection (MONITOR/metrics auth
decision, SECRETS.md #145), and any secret-bearing field is redacted there too.

### Emitted even in shadow mode

- In shadow mode the advisor mutates nothing live (ADR-0013) yet records every
recommendation it would have applied, with the same schema and the would-be
from/to and replay evidence. This is the evidence the #90 headroom study and the
#154 gate consume to decide whether active tuning is ever justified
[wtinylfu-caffeine-sketch]: shadow logging is the safe first rung of the
off -> shadow -> active ladder, producing an auditable trail before any knob
moves.

### Retention

- Retention is bounded and configurable: a ring of the last N records plus all
records since the current snapshot version, whichever is larger, so the full
causal history of the live config is always present even after the ring wraps.
Rollback and kill-switch records are retained at a higher floor than routine
knob changes, because they are the post-incident record. Eviction of old records
re-anchors the hash chain with a checkpoint digest so tamper-evidence survives
truncation.

## Open questions

- The admin verb's exact name/shape (a SLOWLOG-style RESP reply vs a CONFIG-style
subcommand), settled with the #150 admin-command surface.
- Whether the journal is per-shard (matching the shared-nothing core) with a
merged read view, or a single core-0-owned log, and the seed scope this implies
(the #91 per-shard-vs-global seed open question).
- Default retention floors for routine vs safety records, and whether the chain
checkpoint digest is itself exported for external verification.

## Acceptance and test hooks

- Every applied knob change and every rollback/kill-switch trip produces exactly
one chained record carrying snapshot version, from/to, trigger, margin, seed,
and cause; a mid-journal edit is detected as a chain break on read.
- In shadow mode no knob mutates live yet the recommendation log grows with full
schema (asserted against ADR-0013 posture).
- INFO advisor fields and the `/metrics` counters agree with the journal contents
and stay within the #152 cardinality bound under an adversarial knob workload.
- A seeded replay reproduces an identical event stream (the #91 determinism
invariant projected onto the log).

## References

- ADR-0013; issues #153, #91, #126, #154, #90, #85, #86, #152, #150, #145, #58,
#1; specs ADVISOR.md, ADVISOR_SAFETY.md, ADVISOR_PROMOTION.md, OBSERVABILITY.md,
CONFIG.md, SECRETS.md, PERSISTENCE.md.
- Claims: [cacheus-experts], [lecar-regret-minimization-smallcache],
[hit-ratio-can-hurt-throughput], [wtinylfu-caffeine-sketch].
112 changes: 112 additions & 0 deletions docs/design/ADVISOR_PROMOTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Design: Advisor evaluation and promotion gate

Issue: #154. Decisions: ADR-0013 (advisor default posture is shadow/off).
Related: ADVISOR_SAFETY.md (#91, live rollback, distinct from this pre-promotion
gate), ADVISOR.md (#126, the controller whose candidates are gated),
ADVISOR_AUDIT.md (#153, which records each verdict), TESTING.md/BENCHMARK.md
(#95/#96/#93, the replay harness and oracle), CONFIG.md (#85, the snapshot store).

## Goal and scope

The hard project rule is that an advisor change must beat the tuned static
baseline on replayed traces before it may act. This spec owns that promotion gate:
an offline-replay plus shadow-A/B pipeline that proves a candidate config beats
the live static baseline by a quantified, harness-tuned margin before the
controller is allowed to publish it. It turns the one-time #90 headroom study and
the #93 offline oracle into a continuous gating pipeline that makes the
"no regression below baseline" target enforceable. #153 records what happened;
this decides what is allowed. In scope: the baseline definition, the two gate
stages, the acceptance margin, and the no-regression sign-off. Out of scope: live
rollback after promotion (#91), the controller internals (#126), and the oracle
implementation (#93).

## Design

### The baseline a candidate must beat

- The gate's reference is the tuned static baseline: W-TinyLFU admission
[wtinylfu-caffeine-sketch] over the SIEVE/S3-FIFO eviction floor
[sieve-simpler-than-lru-nsdi24] [s3fifo-small-main-split], with its own knobs
tuned per trace first so the advisor competes against the best deterministic
effort, not a strawman (the #90 measurement hazard). This is the same static
baseline #91's kill-switch reverts to, so "beats baseline on replay" and
"kill-switch target" name one config.

### Stage 1: offline replay against the oracle

- A candidate config is replayed over the trace corpus in the benchmark-only
oracle harness (#93), scoring hit ratio at matched cache sizes and reporting the
gap to the Belady-MIN ceiling and the per-policy gap table [lhd-hit-density].
The candidate must close more of the baseline-to-MIN gap than the tuned baseline
by the acceptance margin. Scoring is hit ratio off the hot path only; the gate
never runs a per-access shadow simulator on a live request, because a higher hit
ratio reached by hot-path surgery can lower throughput
[hit-ratio-can-hurt-throughput]. Learned-Belady predictors appear here only as
offline ceilings (the #13 non-goal), never as a deployable policy
[parrot-imitation-belady-icml20] [lrb-relaxed-belady-gbm].

### Stage 2: shadow A/B against the live baseline

- A candidate that passes Stage 1 runs in shadow against live traffic (ADR-0013):
the live baseline serves requests while the candidate is scored on the same
access stream off the hot path. The gate compares candidate vs baseline hit
ratio over a window and requires the candidate to win by the margin with a
no-regression sign-off on the watched throughput-per-core signal. Only a
candidate that clears both stages becomes eligible for the controller to publish
as a new snapshot; in shadow posture it still publishes nothing live, it only
records eligibility (#153).

### Acceptance margin and sign-off

- The margin is harness-tuned, not a slogan: a minimum marginal hit-ratio gain
over the tuned baseline at the cache-to-working-set ratios IronCache actually
runs, defended against the operational cost of an adaptive component (the #90
open question). The margin is set conservatively because the adaptive gain
concentrates on small caches and can evaporate or invert on the large,
frequency-dominated caches IronCache expects [lecar-regret-minimization-smallcache]
[cacheus-experts]; the expert pool here is the cheap O(1) controller, not a
per-request ensemble [lecar-regret-min-18x]. A candidate inside the noise band,
or that regresses throughput-per-core, is rejected, not promoted.

### Relationship to live rollback

- The promotion gate is pre-action and offline-plus-shadow; #91 rollback is
post-action and live. A change must clear this gate to act at all; once acting,
#91's regression detector can still revert it and the kill-switch can still drop
to baseline. The two compose: this minimizes how often rollback fires by never
letting an unproven change act, and rollback covers the residual case where
replay and shadow did not predict the live result.

## Open questions

- The exact acceptance margin per knob class and the shadow-A/B window length,
shared with #91's threshold/window open decision and calibrated on the corpus.
- Trace-corpus weighting for the verdict (the #90 in-memory-KV weighting), and
whether Stage 1 must pass on every corpus trace or on a weighted majority.
- Whether shadow A/B is per-shard or global, and how candidate scoring is
isolated from the live serving path's cache state.
- Re-promotion cadence: how often a previously rejected candidate may be re-tried
as the workload drifts, without flapping.

## Acceptance and test hooks

- A candidate that does not beat the tuned static baseline by the margin in Stage
1 replay is never promoted; the gap-to-MIN table (#93) is recorded for the
verdict (#153).
- A candidate that passes Stage 1 but loses or only ties the shadow A/B, or
regresses throughput-per-core, is rejected with a no-regression sign-off failure
[hit-ratio-can-hurt-throughput].
- In shadow posture (ADR-0013) the gate records eligibility but the controller
publishes nothing live.
- A seeded replay of the same candidate and trace yields an identical verdict (the
#91 determinism invariant applied to the gate).

## References

- ADR-0013; issues #154, #91, #126, #153, #90, #93, #95, #96, #85, #13, #1; specs
ADVISOR.md, ADVISOR_SAFETY.md, ADVISOR_AUDIT.md, TESTING.md, BENCHMARK.md,
CONFIG.md.
- Claims: [wtinylfu-caffeine-sketch], [sieve-simpler-than-lru-nsdi24],
[s3fifo-small-main-split], [lhd-hit-density], [hit-ratio-can-hurt-throughput],
[parrot-imitation-belady-icml20], [lrb-relaxed-belady-gbm],
[lecar-regret-minimization-smallcache], [cacheus-experts], [lecar-regret-min-18x].
5 changes: 5 additions & 0 deletions docs/design/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,3 +106,8 @@ Specs added as the M1 milestone progresses.
- [ADVISOR.md](ADVISOR.md): the per-shard background advisor (LeCaR/bandit expert
weighting, bounded knobs, atomic RCU config swap, EvictionPolicy-trait binding,
shadow/off default per ADR-0013) (#126).
- [ADVISOR_AUDIT.md](ADVISOR_AUDIT.md): the durable tamper-evident advisor
decision/audit log (knob deltas, trigger, snapshot version, replay evidence,
rollback/kill events), surfaced via INFO/metrics, emitted even in shadow (#153).
- [ADVISOR_PROMOTION.md](ADVISOR_PROMOTION.md): the offline-replay + shadow-A/B
promotion gate proving a change beats the static baseline before it acts (#154).
Loading