Skip to content

Releases: broomva/bstack

v0.21.1 — Phase 4a content & media migration

25 May 22:45
4b48237

Choose a tag to compare

Phase 4a content & media migration

Three more skills migrated from standalone broomva/<name> repos into the broomva/skills Tier-2 monorepo (broomva/skills PR #4 merge 2f5aec4):

  • blog-post — full-stack blog post production (substantial: 28KB SKILL.md + examples/ + references/ + scripts/publish.sh + templates/). NEW registry entry (was previously bundled / not registered separately).
  • brand-icons — brand icon and visual identity asset generation. Registry entry updated from repo: broomva/brand-iconsrepo: broomva/skills, skillPath: skills/brand-icons/SKILL.md.
  • seo-llmeo — SEO and LLM Engine Optimization (audits, meta tags, structured data, llms.txt). Registry entry updated to monorepo path.

Each source repo carries a redirect-stub README during a 6-month deprecation window (until 2026-11-25):

  • broomva/blog-post PR #1 (merge a7d90b6)
  • broomva/brand-icons PR #1 (merge 2e20534)
  • broomva/seo-llmeo PR #1 (merge 2b635d6)

Files changed

  • references/companion-skills.yaml — 2 entries rewritten (brand-icons, seo-llmeo), 1 new entry added (blog-post)
  • references/skills-roster.md — install commands updated to monorepo paths
  • VERSION0.21.00.21.1 (additive patch — new entries + corrected install paths)
  • CHANGELOG.md — this entry

Pattern note: multi-source-repo migration

Phase 3 migrated 9 sub-skills from ONE bundled source (broomva/strategy-skills/.skills/). Phase 4a tests the multi-source pattern — 3 separate standalone source repos, each with full skill layout (SKILL.md + scripts/ + references/ + assets/). The migration command is now well-rehearsed and ready to crystallize into the bstack skill graduate CLI (Phase 6b).

broomva/blog-post's root canonical content was preserved; 24 IDE-specific dotfile-mirror dirs (.agent/, .claude/, .continue/, etc.) were excluded as deployment artifacts — downstream agents resolve to skills/blog-post/SKILL.md per the agentskills.io spec.


v0.20.0 — Cross-review CLI — restore the P20 mechanism (BRO-1227 Fix B)

22 May 23:43
ac42639

Choose a tag to compare

Cross-review CLI — restore the P20 mechanism (BRO-1227 Fix B)

The P20 (broomva/cross-review) primitive — cross-model adversarial review on substantive PRs before merge — failed reliably during the 2026-05-21 Wave 3 dispatch session: both Cato sub-agent dispatches stalled within 6-7 tool uses with path-resolution errors. The Cato agent was invoked from the miami workspace but asked to read files at ~/broomva/broomva.tech/... — the working tree was on a different branch than the PR's head SHA, so Read-tool calls drifted into "let me locate the actual repo" loops and never produced output.

This release ships Fix B: a bstack cross-review CLI that reads PR contents via gh pr diff + gh api repos/.../contents/<path>?ref=<sha>. Working-tree state is eliminated as a variable — the CLI can be invoked from any cwd; only --repo <owner/name> + PR number matter.

New files (3)

  • NEW scripts/cross-review.py — argparse CLI. Fetches PR metadata (gh pr view), the diff (gh pr diff), and post-change file contents (gh api …/contents/<path>?ref=<head_sha>); skips lock files and >2000-line adds; bundles into a structured codex prompt; invokes codex exec --sandbox read-only --model gpt-5.4 --skip-git-repo-check with a 240s default timeout; parses JSON verdict (with fallback try_parse_json extractor that recovers a balanced {…} object from prose-wrapped output); writes structured JSON to .bstack-cross-review/<pr>.json + markdown to <pr>.md. Verdict schema: verdict (pass/concerns/fail/skipped) × anti_slop_score (0-10) × criticality (high/medium/low) × findings[] × blind_spots_surfaced[] × summary. Optional --post-comment posts the markdown verdict back to the PR. Exit codes: 0 pass · 10 concerns · 20 fail · 30 skipped · 2 invocation/gh failure.
  • NEW bin/bstack-cross-review — thin shim mirroring bin/bstack-wave: dispatches to scripts/cross-review.py, forwards argv unchanged.
  • NEW tests/cross-review.test.sh — 8-test hermetic offline smoke (dispatcher routing, argparse rejection cases, module import, try_parse_json recovery from prose-wrapped JSON, exit_code_for_verdict mapping). No network calls — end-to-end validation is the per-PR --dry-run against real PRs documented in the PR body.

Changed files (3)

  • CHANGED bin/bstack — adds cross-review) dispatch entry routing to bin/bstack-cross-review. Adds usage section "Review" with a one-liner pointing at cross-review <pr-num> --repo <owner/name> (≥ 0.20.0). Adds the canonical invocation to the Examples block.
  • CHANGED SKILL.md — Quick start block lists bstack cross-review with the BRO-1227 Fix B annotation and 0.20.0 introduction marker.
  • CHANGED VERSION0.19.0 → 0.20.0.

Why Fix B over Fix A or Fix C

  • Fix A (add --cwd parameter to Cato dispatch) leaves the working-tree-state coupling intact — every future P20 invocation has to remember to set it, and a stale checkout silently degrades review quality. The failure mode comes back the next time someone uses Cato across repos.
  • Fix B (always read from git, never the working tree) eliminates the failure mode by construction. The bug surface goes away.
  • Fix C (full skill repo + agent definition + tmp-checkout pipeline) is the complete answer but requires writing the ~/.claude/skills/cross-review/ skill, deciding whether the Cato agent stays as the codex-exec frontend or gets re-architected, and managing the tmp-checkout cleanup contract. Larger blast radius — deferred to a follow-up once Fix B has soaked.

Test plan executed

bash -n bin/bstack-cross-review                                      # syntax OK
bash -n bin/bstack                                                   # syntax OK
python3 -m py_compile scripts/cross-review.py                        # OK
bin/bstack --help | grep cross-review                                # 2 lines
bin/bstack cross-review --help | head -20                            # argparse usage
bin/bstack cross-review 195 --repo broomva/broomva.tech --dry-run    # 9 files fetched, 1 lock skipped
bash tests/cross-review.test.sh                                      # 8/8 pass

What's next (not in this release)

  • Apply bstack cross-review to the 3 PRs that merged WITHOUT P20 cross-review last session — broomva.tech#195, #196, life#1427 — and post retro-verdicts as PR comments. Out of scope for this PR (no code change needed; this PR ships the tool).
  • File a follow-up for the full ~/.claude/skills/cross-review/ skill (Fix C scope) once Fix B has soaked through ≥3 P20 invocations.

Backreferences

  • BRO-1227 — P20 cross-review mechanism gap (closes via Fix B)
  • 2026-05-22 session handoff — /Users/broomva/conductor/archived-contexts/broomva/wave-3-dispatch-and-linear-updates/handoffs/2026-05-22-SESSION-HANDOFF.md §"Queued + ready to dispatch"
  • CLAUDE.md §"Cross-Review (P20)" — the discipline rule this mechanism enforces

v0.19.0 — Closure Contract — generalize 5-tuple from 4 RCS layers to N declared arcs

22 May 20:47
7347554

Choose a tag to compare

Closure Contract — generalize 5-tuple from 4 RCS layers to N declared arcs

Builds on v0.18.0 (Phase 8 federation, BRO-47) — the federation registry is the substrate that lets per-workspace arc declarations roll up via bstack status --aggregate. Together, v0.18.0 + v0.19.0 close the substrate-completion arc through the user-defined-arcs layer.

v0.14.0 + v0.16.0 already shipped a 5-tuple (plant, sensor, controller, actuator, termination) for 4 hard-coded RCS layers via assets/templates/rcs-parameters.toml.template + scripts/compute-budget-status.sh. This release lifts the same pattern from those 4 layers to N user-declared domain arcs the workspace actually runs every day (PR greenflow, bookkeeping promotion quality, deploy reliability, etc.).

The closure contract: every arc declares (id, plant_surfaces, sensor, actuator, termination, tau_a, shield_ref). The agent's reasoning is the universal Π (controller) — that's not declared, it's the default binding when actuator.kind == "agent_reasoning". Script / mcp_tool / http actuators bind specific mechanisms while keeping the agent in the supervisory role.

Companion: point-in-time → trend monitoring for composite-ω. compute-budget-status.sh --trend appends one snapshot per call to .control/audit/composite-omega-history.jsonl, then reads the last 7 days and reports slope + verdict (stable_flat | drift_up | drift_down | volatile). Doctor §21 surfaces a hard gap only on drift_down — composite stability shrinking is the signal worth interrupting on.

New files (4)

  • NEW schemas/arcs.v1.json — JSON-schema draft-07 for .control/arcs.yaml. Mirrors the style of schemas/policy.v1.json and schemas/workspaces.v1.json. Required arc fields: id (same character class as workspace registry name), plant_surfaces (free-form URIs), sensor (enum: exit_code | json_path | log_match | metric_threshold), actuator (enum: agent_reasoning | script | mcp_tool | http), termination (enum: predicate | wallclock | score_threshold | exit_zero), tau_a (number, seconds). Optional: shield_ref pointing at a policy.yaml gate.
  • NEW assets/templates/arcs.yaml.template — declarative arcs template with 2 worked examples and heavy commentary mirroring the rcs-parameters.toml.template intro cadence. Example 1: code-pr-greenflow (json_path sensor, agent_reasoning actuator, predicate termination, tau_a=1800s). Example 2: bookkeeping-promotion-quality (exit_code sensor, agent_reasoning actuator, score_threshold termination, tau_a=86400s).
  • NEW scripts/compute-arc-status.sh — per-arc verdict reader; mirrors the shape of scripts/compute-budget-status.sh exactly. Looks at .control/arcs.yaml → falls back to bundled template. For each arc: runs the sensor (bash -c for exit_code/json_path/metric_threshold; regex against log file for log_match), reads most recent termination event from .control/audit/arc-<id>.jsonl, evaluates termination predicate, emits verdict green | yellow | red | unknown. Outputs JSON (default) or --human table. Exit codes: 0 all green, 1 ≥1 red, 2 config missing, 3 python3 unavailable. Ships its own inline minimal YAML parser (modeled on scripts/workspace.py _yaml_minimal_parse) — PyYAML preferred, falls back when absent, both code paths exercised by the test suite.
  • NEW tests/arcs-validation.test.sh + NEW tests/omega-drift-trend.test.sh — hermetic bash test suites in the tests/metrics-pipeline.test.sh style. 6 + 6 tests; both GREEN under system Python (PyYAML, no tomllib path) AND homebrew Python (tomllib, no PyYAML path). Tests exercise schema rejection (schema_version: 99), template loading, override precedence, drift_down / drift_up / stable_flat verdicts on synthetic data, and idempotent history-line writes per --trend call.

Changed files (2)

  • CHANGED scripts/compute-budget-status.sh — adds --trend flag. Without --trend: existing point-in-time behavior preserved. With --trend: appends {ts, omega, per_layer} snapshot to .control/audit/composite-omega-history.jsonl, then reads last 7 days, computes least-squares slope, baseline (median of first day in window), deviation, volatility (CV), and verdict. Verdict heuristic prefers drift detection over volatility when there's a clear directional signal (slope sign matches relative-deviation sign with magnitude > 1%); volatility is the residual category. Trend block surfaces in --human as one extra line and as a top-level trend object in --json.
  • CHANGED scripts/doctor.sh — adds §20 and §21. §20 reads .control/arcs.yaml (informational when absent), reports arc count + completeness count, surfaces last-termination-event timestamp per arc; hard gap only if schema_version != 1. §21 reads composite-omega-history.jsonl, calls compute-budget-status.sh --trend --json, reports last/baseline/slope/verdict; hard gap only if verdict == drift_down.

Test plan executed

bash -n scripts/compute-arc-status.sh                            # syntax OK
bash -n scripts/compute-budget-status.sh                         # syntax OK
bash -n scripts/doctor.sh                                        # syntax OK
python3 -c "import json; json.loads(open('schemas/arcs.v1.json').read())"  # schema parses
bash scripts/compute-arc-status.sh --human                       # reads template, prints table for both arcs
bash scripts/compute-budget-status.sh --trend --human            # writes 1 history line, prints trend line
bash tests/arcs-validation.test.sh                               # 6/6 GREEN under both python envs
bash tests/omega-drift-trend.test.sh                             # 6/6 GREEN
bash scripts/doctor.sh against ~/broomva                         # §20 + §21 visible; 87/89 (2 pre-existing gaps unrelated)

Honest scope caveats

  • The minimal inline YAML parser inside compute-arc-status.sh covers exactly the shape schemas/arcs.v1.json declares. Workspaces that hand-write .control/arcs.yaml with PyYAML-only features (anchors, multi-doc, flow-style) will need PyYAML installed; otherwise stick to the block-scalar shape shown in the template.
  • arc-<id>.jsonl audit-event writers are not shipped in this PR. Termination events are read by compute-arc-status.sh when present; for now, only wallclock and exit_zero terminations evaluate without a prior recorded event. predicate and score_threshold arcs surface yellow (running) until an event lands. Follow-up: add scripts/arc-event-hook.sh so actuators can record verdicts as they close.
  • Verdict thresholds in --trend (1% relative deviation, 10% coefficient of variation) are heuristic and calibrated for the broomva workspace's λ range. Tighten / loosen via follow-up policy.yaml block after rule-of-three failure modes accumulate.
  • The "ω is shrinking" signal in §21 fires only after ≥ 2 history points span the 7-day window. Workspaces that don't periodically invoke --trend (no scheduled call from /loop or a cron) will see only stable_flat regardless of underlying drift.

Spec doc + cross-references

  • Anchored arcs: prior PR (v0.16.0) shipped the 4-layer hard-coded analogue (assets/templates/rcs-parameters.toml.template); v0.14.0 shipped the L3 enforcement; this PR generalizes both surfaces to N user-declared arcs.
  • Why not a new primitive: the Closure Contract is the generalization of the existing (X, U, h, Π, T) substrate that L0–L3 already use. It's a declarative surface lift, not a new reflex. P21 "Closure Contract" promotion candidate logged — promotion to a numbered primitive deferred until rule-of-three concrete failures are recorded (per the L3 stability budget's stability budget for governance churn). The candidate ledger lives in research/entities/pattern/bstack-engine.md per CLAUDE.md §Bstack Engine.

v0.18.0 — Phase 8 — Multi-workspace federation registry

22 May 20:39
fc978ed

Choose a tag to compare

Phase 8 — Multi-workspace federation registry

Closes the substrate-completion-arc Phase 8 backlog item: introduces an opt-in
host-level registry (~/.broomva/global/registry.yaml) that catalogues every
bstack-governed workspace on this machine, plus a bstack status --aggregate
rollup that walks the registry and surfaces cross-workspace composite-ω.
Federation is read-only aggregation — each workspace remains the source
of truth for its own state; the registry is the index, not the database.

This release lands on top of v0.16.0 (multi-layer RCS closure) and v0.17.0
(BROOMVA_ROOT convention). Together they form the substrate that PR2
(v0.19.0, BRO-48 — Closure Contract) generalizes from 4 hard-coded RCS layers
to N user-declared domain arcs.

New scripts + bin (3)

  • NEW bin/bstack-workspace — 90-line bash dispatcher delegating to
    scripts/workspace.py. Subcommands register | list | info | deregister
    with --json + --tag + per-subcommand --path/--name filters. Exit
    codes documented in the help block: 0 ok, 2 invalid args, 3 schema/parse
    error, 4 target not found, 5 name conflict at different path.
  • NEW scripts/workspace.py — 523-line Python registry manager with an
    inline minimal-YAML parser fallback (when PyYAML is absent), atomic
    writes (.tmp + replace), and schema-version checking. Honors
    BSTACK_REGISTRY env (default ~/.broomva/global/registry.yaml) and
    BSTACK_DIR for VERSION detection. SLO: register/list p50 < 100ms.
  • NEW schemas/workspaces.v1.json — JSON-schema draft-07 contract for
    the registry. schema_version: 1 is the only valid value at v0.18.0;
    field bstack_version matches ^[0-9]+\.[0-9]+\.[0-9]+(-[A-Za-z0-9.-]+)?$;
    name matches ^[A-Za-z0-9][A-Za-z0-9._-]*$ (1–64 chars).

New tests (1)

  • NEW tests/workspace.test.sh — 203-line hermetic bash suite, 10 cases:
    fresh register, refresh on same path, name conflict at different path (exit
    5), info --path reports registered: true, deregister by path, deregister
    miss (exit 4), schema_version=99 → exit 3, tag accumulation on refresh,
    invalid name (exit 2), --help block renders. Every test uses
    BSTACK_REGISTRY=$(mktemp) so the host registry is untouched. All 10 GREEN.

Changed (4)

  • CHANGED bin/bstack — dispatcher: new workspace) exec "$BIN_DIR/bstack-workspace" "$@" case + Federation: usage section
    • status --aggregate cross-reference under Observability.
  • CHANGED bin/bstack-status — adds --aggregate (alias
    --multi-workspace) flag. Reads the registry via bash bin/bstack-workspace list --json, then for each entry attempts bash <path>/bstack/scripts/compute-budget-status.sh --json
    (falls back to reading <path>/.control/audit/composite-omega.jsonl),
    emits a table: name × bstack_version × composite_ω × last_seen × verdict.
    JSON form via --json --aggregate. Read-only, no writes anywhere.
  • CHANGED scripts/doctor.sh — adds §20 (Workspace federation registry):
    informational when no registry present (federation opt-in); hard gap on
    schema mismatch (schema_version != 1); soft warning on entries with
    last_seen_at > 30 days. Reads BSTACK_REGISTRY env or default path.
  • CHANGED assets/templates/{SKILL,AGENTS,CLAUDE}.md.template — light
    surface updates so freshly-bootstrapped workspaces document the
    bstack workspace commands. Federation is not a new primitive (no
    P21); it composes existing primitives (Snapshot P15 + multi-layer ω
    from v0.16.0 §19).

Doctor section table after this release

§ Title Source Hard gap when
§14 RCS stability budget compute-lambda.sh any λᵢ ≤ 0
§15 L3 stability gate-flow wiring install-l3-stability.sh (informational only)
§16 L0 plant audit l0-tools.jsonl >10000 events runaway
§17 L1 autonomic reflex compliance l1-reflexes.jsonl compliance < 30%
§18 L2 EGRI promotion throttle l2-promotions.jsonl over τ_a₂ budget
§19 Multi-layer composite health compute-budget-status.sh any layer unstable
§20 Workspace federation registry bin/bstack-workspace list schema_version != 1

Test plan executed

  • bash -n syntax check across all new + modified scripts → clean.
  • bash bin/bstack-workspace --help → renders subcommand block.
  • BSTACK_REGISTRY=/tmp/test.yaml bash bin/bstack-workspace register --path /tmp --name test --json
    → exit 0, action: registered.
  • BSTACK_REGISTRY=/tmp/test.yaml bash bin/bstack-workspace list --json → count: 1.
  • BSTACK_REGISTRY=/tmp/test.yaml bash bin/bstack-workspace deregister --name test --json
    → exit 0, count: 0.
  • bash tests/workspace.test.sh → 10/10 GREEN.
  • bash scripts/doctor.sh against worktree → 87/90 passed (baseline; §20
    fires as informational with no registry — same total).
  • bash scripts/compute-budget-status.sh --human against ~/broomva
    → composite still stable.

Honest scope caveats

  • Federation is local-filesystem only. No network/IPC transport. A
    workspace on a remote host has to register itself locally; cross-host
    rollup is deferred.
  • The registry is read-only aggregation. bstack status --aggregate does
    not mutate any registered workspace's state. CRDT-replicated cross-workspace
    promotion (the swarm-autoresearch-loop primitive) is a future Phase 8.5
    spec, not in this release.
  • last_seen_at is updated on register (which is also "refresh"); it is
    not automatically updated by --aggregate. A future hook may bump
    last_seen_at on every successful compute-budget-status.sh invocation
    per workspace.

Spec doc + cross-references

  • Linear ticket: BRO-47
  • Prior release: v0.17.0 (BROOMVA_ROOT convention, #47, BRO-1223 follow-up)
  • Next release: v0.19.0 (Closure Contract — arcs.yaml + composite-ω drift trend,
    BRO-48) builds on this substrate.
  • Concept entity: research/entities/concept/closure-contract.md (in broomva
    workspace) — captures the 5-tuple generalization Phase 8 helps enable.

v0.16.0 — Multi-layer RCS closure — extending the control loop across L0/L1/L2

22 May 18:05
b89354c

Choose a tag to compare

Multi-layer RCS closure — extending the control loop across L0/L1/L2

Closes the gap identified in PR #45 review: v0.14.0 wired enforcement only at L3 (governance), while L0/L1/L2 had no audit, no per-layer doctor sections, no programmatic feedback into the control loop. The receipts produced by broomva/dogfood were human-read artifacts but did not feed back into the multi-layer stability budget.

This release adds per-layer sensors, audit logs, doctor sections, and a composite multi-layer health report. The dogfood receipt and every tool call become the empirical sensors that calibrate the RCS hierarchy.

New scripts (5)

  • NEW scripts/l0-tool-audit-hook.sh — Claude Code PostToolUse hook. Logs every tool call (tool name, latency_ms, is_error, file_path when applicable) as one JSONL line to .control/audit/l0-tools.jsonl. Always exits 0; never blocks.

  • NEW scripts/l1-reflex-audit-hook.sh — Claude Code Stop hook. Scans the session transcript for evidence of 21 /autonomous reflexes firing (mechanism cube, lens intake, snapshot, dep-chain, worktree decision, ticket, dogfood plan, validation, first write, empirical, PR opened, watcher, healing, cross-review, deploy verify, receipt, PR comments, auto-merge, cleanup, bridge, bookkeeping). Writes per-session compliance bitmask + anti-rationalization-line evaluation to .control/audit/l1-reflexes.jsonl.

  • NEW scripts/l2-promotion-audit-hook.sh — L2 (EGRI / Crystallize P16) candidate-promotion sensor. Called by bookkeeping.py promote step with promotion metadata. Counts promotions in last τ_a₂ window (1h default); enforces budget (5 promotions/window default). Exit 2 with warning when over budget — caller SHOULD defer remaining promotions.

  • NEW scripts/compute-budget-status.sh — Multi-layer health reader. Reads all four audit logs (.control/audit/l[0-3]-*.jsonl) + parameters.toml; computes per-layer observed metrics in each layer's τ_a window; emits composite verdict (stable / stable_warn / unstable) per layer + overall.

  • NEW scripts/install-rcs-stability.sh — Unified multi-layer installer. Delegates L3 setup to install-l3-stability.sh (preserves v0.14.0 behavior), then merges PostToolUse (L0) + Stop (L1) hook entries into .claude/settings.json via _bstack_primitive markers. Creates .control/audit/ directory. Idempotent.

New template (1)

  • NEW assets/templates/settings.json.multi-layer-hooks.snippet — PostToolUse + Stop hook entries. Composes additively with v0.14.0's settings.json.l3-stability-hook.snippet (PreToolUse) — each hook entry is uniquely identified by _bstack_primitive marker (L0-audit, L1-audit, L3-G0) so re-installation is structurally idempotent.

Doctor extensions (4 new sections)

  • CHANGED scripts/doctor.sh adds:
    • §16 L0 plant audit — tool-call count + latency mean + error count over last 10min (informational; hard gap only on >10000 events runaway).
    • §17 L1 autonomic reflex compliance — per-session mean compliance rate over last 24h + dogfood-yes count (hard gap if < 30%; soft warn 30-60%).
    • §18 L2 EGRI promotion throttle — promotions in last τ_a₂ window vs budget; hard gap when over budget.
    • §19 Multi-layer composite health — calls compute-budget-status; surfaces per-layer verdicts (L0=stable L1=stable L2=stable L3=stable form). Hard gap only if any layer "unstable".

Onboard + repair (wired to install-rcs-stability)

  • CHANGED scripts/onboard.sh calls install-rcs-stability.sh after bootstrap (replaces v0.14.0 call to install-l3-stability.sh). Falls back to L3-only installer if multi-layer one is absent.
  • CHANGED scripts/repair.sh runs install-rcs-stability.sh when doctor reports any L0/L1 audit-log gap, missing G0/G1/G2, or unstable λ. Falls back to L3-only installer.

What changes operationally

Before v0.16.0:

  • L0/L1/L2 stability was uncontrolled at the audit level.
  • The dogfood receipt was a PR artifact, not a control-loop sensor.

After v0.16.0:

  • Every tool call logs to L0 audit. Every session ends with an L1 reflex-compliance record. Every bookkeeping promotion records to L2.
  • bstack doctor §19 and bash scripts/compute-budget-status.sh --human show the multi-layer health on demand.
  • The dogfood receipt's anti-rationalization line is parsed by the L1 Stop hook and recorded as a binary signal — receipts auto-feed back into the control loop.

The 4-gate flow now per-layer

Layer Sensor Audit log Doctor section Throttle
L0 plant PostToolUse hook l0-tools.jsonl §16 informational; runaway detection
L1 autonomic Stop hook (transcript scanner) l1-reflexes.jsonl §17 hard gap if compliance < 30%
L2 EGRI bookkeeping promote step l2-promotions.jsonl §18 hard gap if over τ_a₂ budget
L3 governance PreToolUse + git pre-commit + GH Actions (v0.14.0) l3-edits.jsonl §14 + §15 hard gap if λ₃ ≤ 0

Test plan executed

  • bash -n scripts/*.sh — syntax clean on all new + modified scripts
  • compute-budget-status.sh --human against ~/broomva → reports L0=stable L1=stable L2=stable L3=stable; composite ω = 0.006398
  • L0 hook tested with synthetic JSON input → entry appended correctly to l0-tools.jsonl
  • L2 hook tested with --slug --type --score --source → entry appended; over-budget case (5 + 1 in 1h) → exit 2 with warning
  • install-rcs-stability.sh --dry-run on fresh workspace → reports all install steps without writing
  • install-rcs-stability.sh real → L3 (4 files via install-l3-stability) + L0 + L1 hooks merged into settings.json + .control/audit/ created
  • Re-run installer → all hooks skipped via _bstack_primitive markers (idempotent)
  • doctor.sh against ~/broomva → §16-§18 informational (audit logs not yet wired in broomva); §19 reports L0=stable L1=stable L2=stable L3=stable composite. 89/91 passed, 2 pre-existing gaps unrelated.

Honest scope caveats

  • L0 audit-log retention is unbounded by default; rotation policy (policy.yaml [audit_retention]) is deferred to a follow-up.
  • L2 wiring requires bookkeeping.py promote to call l2-promotion-audit-hook.sh. The hook itself ships in this PR; the bookkeeping integration is a small follow-up in the broomva/bookkeeping repo (one-line subprocess.run after the promotion write).
  • L1 transcript-scanner uses heuristic substring matches against the session log. False positives possible (e.g., "interceptor" in a non-validation context counts as r10_empirical). The heuristic is intentionally permissive at v0.16.0; tightening is a follow-up after rule-of-three failure cases accumulate in the audit log.

Spec doc + cross-references

  • Spec: ~/broomva/conductor/workspaces/broomva/doha/docs/reports/2026-05-22-multi-layer-closure-spec.html
  • Prior PR (L3 closure): #45 (v0.14.0, merged 2026-05-22)
  • Dogfood skill: github.com/broomva/dogfood v0.1.0
  • /autonomous flow record: ~/broomva/conductor/workspaces/broomva/doha/docs/reports/2026-05-22-autonomous-flow-achieved.html

v0.14.0 — L3 stability closure — compute + enforce λ in every bstack workspace

22 May 16:22
a6f3a59

Choose a tag to compare

L3 stability closure — compute + enforce λ in every bstack workspace

Closes the gap between the RCS paper (research/rcs/papers/p0-foundations/) and operational reality. Previously, λ₃ ≈ 0.006 was cited in CLAUDE.md and AGENTS.md as static text — no bstack script or hook computed or enforced it. This release wires the math + a four-gate flow (Claude Code hook → git pre-commit → CI → doctor) so every bstack-using workspace inherits computational stability checking on first onboard.

New scripts

  • NEW scripts/compute-lambda.sh — bash CLI that recomputes per-level λᵢ from a workspace's parameters.toml (looks at .control/rcs-parameters.toml, research/rcs/data/parameters.toml, or the bundled template). Implements the formula λᵢ = γᵢ − L_θᵢ·ρᵢ − L_dᵢ·ηᵢ − βᵢ·τ̄ᵢ − ln(νᵢ)/τ_aᵢ. Emits JSON or human-readable; exit 0 if all stable, 1 if any λᵢ ≤ 0, 3 on drift > 1e-4 (--strict).

  • NEW scripts/l3-rate-gate.sh — governance commit rate limiter. Reads L3-class path patterns from .control/rcs-parameters.toml [gates.l3_paths] (default: CLAUDE.md, AGENTS.md, .control/policy.yaml, .control/rcs-parameters.toml, METALAYER.md). Counts L3-class commits in the last τ_a₃ window (default 86400s = 1 day). Exits 0 within budget, 1 exceeded. Supports --staged (include uncommitted-but-staged for pre-commit use) and --warn-only.

  • NEW scripts/install-l3-stability.sh — one-shot installer that deploys the L3 gate flow into a workspace: .control/rcs-parameters.toml + .githooks/pre-commit (G1) + .github/workflows/l3-stability.yml (G2) + .claude/settings.json PreToolUse hook entry (G0). Idempotent; existing files preserved unless --force. Settings.json merge is structurally idempotent (skips if _bstack_primitive: "L3-G0" already present).

  • NEW scripts/l3-stability-pretool-hook.sh — Claude Code PreToolUse hook backend. Receives tool-call JSON on stdin; emits warning + audit-log entry when an Edit/Write/MultiEdit targets an L3 path. Defaults to approve (informational) — never blocks the agent; emits a reason string Claude Code surfaces into context.

New templates

  • NEW assets/templates/rcs-parameters.toml.template — default workspace parameters (L0–L3 calibrated from research/rcs/data/parameters.toml). Includes [derived.lambda] cached values + [gates.l3_paths] patterns. Self-documenting with the formula and customization notes for non-Life runtimes.

  • NEW assets/templates/githook-pre-commit-l3-rate.sh.template — Gate G1 git pre-commit hook. Calls l3-rate-gate.sh --staged; blocks commit if rate exceeded (override with git commit --no-verify). Chains to existing .githooks/pre-commit.local if user had a hook before bstack onboarded.

  • NEW assets/templates/gh-workflow-l3-stability.yml.template — Gate G2 GitHub Actions workflow. Triggers on PRs touching L3 paths; runs compute-lambda.sh + l3-rate-gate.sh; comments on the PR with the per-level λ + composite ω + rate verdict; status check stability-check fails if any λᵢ ≤ 0 (can be made required via branch protection).

  • NEW assets/templates/settings.json.l3-stability-hook.snippet — Gate G0 Claude Code PreToolUse hook entry. Merged into .claude/settings.json by install-l3-stability.sh. Fires on Edit/Write/MultiEdit; backend is scripts/l3-stability-pretool-hook.sh.

Doctor extensions

  • CHANGED scripts/doctor.sh adds two sections:
    • §14 RCS stability budget — calls compute-lambda.sh; reports composite ω; HARD gap if any λᵢ ≤ 0; SOFT gap on drift > 1e-4 under --strict.
    • §15 L3 stability gate-flow wiring — verifies G0 (settings.json hook), G1 (.githooks/pre-commit), G2 (.github/workflows/l3-stability.yml), and rcs-parameters.toml are present in the workspace. Each missing piece prints an [info] line with the install command. Informational only (not a hard gap).

Onboarding + repair

  • CHANGED scripts/onboard.sh calls install-l3-stability.sh after bootstrap. New workspaces get the full L3 gate flow on first install.
  • CHANGED scripts/repair.sh runs install-l3-stability.sh when doctor reports any G0/G1/G2 piece missing, parameters.toml absent, or λᵢ ≤ 0. Idempotent.

What this changes operationally

Before: λ₃ ≈ 0.006 was a citation. The agent could read it in CLAUDE.md and reason about it, but no machine path computed or enforced it.

After: every bstack-using workspace can run bash scripts/compute-lambda.sh --human and see the live λ values for its own parameters. bstack doctor recomputes on every run. bstack onboard installs the four gates. CI fails on λᵢ ≤ 0. Git pre-commit blocks excess governance churn. Claude Code agents see warnings when editing L3 files.

The 4-gate flow:

Gate Trigger Action Blocking?
G0 — Claude Code PreToolUse Edit/Write to L3 path Warning to agent + audit log entry No
G1 — git pre-commit Staged L3 path + over-rate Block commit (bypassable with --no-verify) Yes (bypassable)
G2 — GitHub Actions on PR L3 path changed Comment with λ + rate verdict; fail status if λ ≤ 0 Yes (if branch protection)
G3 — bstack doctor §14 + §15 Every SessionStart + on demand Recompute λ + verify wiring Informational

Test plan executed

  • compute-lambda.sh --human against broomva's parameters.toml → all 4 λᵢ match cached (drift ~ 0)
  • compute-lambda.sh with γ₃ perturbed from 0.01 → 0.001 → λ₃ = −0.0026; exit 1
  • l3-rate-gate.sh 24h window against ~/broomva → 0 commits, within budget, exit 0
  • l3-rate-gate.sh --window=2592000 (30-day) → 142 L3 commits, exceeded, exit 1
  • install-l3-stability.sh against fake workspace → 4 files installed; re-run skipped 3 + idempotent settings.json merge
  • l3-stability-pretool-hook.sh with {"file_path": ".../CLAUDE.md"} → emits reason warning; with {"file_path": ".../foo.ts"} → silent approve
  • doctor.sh against ~/broomva → §14 reports composite ω = 0.006398, all stable; §15 reports G0/G1/G2 missing (informational, since broomva hasn't run install-l3-stability.sh yet)

Why this isn't a new primitive

The four gates are mechanisms, not new primitives. The primitive is the existing RCS L3 stability constraint (cited in CLAUDE.md RCS Hierarchy section and Self-Evolution Protocol). This release implements that constraint in code, where v0.13.0 implemented P11 Empirical operationalization in code. No bstack primitive count change.


v0.13.0 — P11 Empirical operationalization — Dogfood Plan reflex + per-stack cookbook + doctor §13

22 May 15:19
f0ace75

Choose a tag to compare

P11 Empirical operationalization — Dogfood Plan reflex + per-stack cookbook + doctor §13

Closes P11's operationalization gap: the discipline of "validate by interacting" was well-defined, but agents lacked a concrete how keyed to the tech stack the workspace was instantiated from. This release adds:

  • NEW references/dogfood-patterns.md — per-stack cookbook with surfaces matrix (Tauri+sidecar / Next.js / Expo RN / Rust CLI / REST API / MCP server). Each pattern names the canonical arc, the skill toolkit (Interceptor mandatory for visual deploy verification; gstack, cliclick, screencapture, curl+jq compose per stack), the gotchas observed in production, and the receipt template. Anchored by the Houston dogfood-pattern.html worked example.

  • CHANGED references/primitives.md §P11 — adds reflex rule 7 (Dogfood Plan keyed to detected stack): before substantive work, the agent produces a plan (entry surface · driver · evidence · smoke · end-to-end · receipt anchor) in the response and PR body, citing the per-stack pattern from the cookbook. Companion-reference callout points to the cookbook.

  • CHANGED assets/templates/AGENTS.md.template §P11 — propagates the reflex rule 7 to every new bstack'd project AND stubs a ## Dogfood Plan (Stack: TBD) block with the row template, so the first substantive feature work has a concrete anchor to fill.

  • CHANGED SKILL.md — surfaces references/dogfood-patterns.md in the on-demand reference index alongside the canonical primitive contract.

  • CHANGED scripts/doctor.sh — adds §13 P11 Empirical dogfood-readiness. Auto-detects tech stack from repo signals (Cargo.toml + src-tauri/ → tauri-sidecar; next.config.* → nextjs; app.json + expo → expo-rn; Cargo.toml solo → rust-cli; openapi.* or REST-framework deps → rest-api; mcp.{json,yaml} → mcp-server). Verifies a Dogfood Plan anchor exists at one of three accepted locations (AGENTS.md ## Dogfood Plan, docs/dogfood-plan.md, or PR body). Informational — never blocks (rule-of-three not yet hit; promotion to blocking gate requires ≥3 documented incidents per Crystallize P16).

  • CHANGED scripts/onboard.sh — after bootstrap, auto-detects stack and substitutes the ## Dogfood Plan (Stack: TBD) placeholder with the detected stack name. Surfaces the cookbook reference in the next-step receipt. Persists detected stack in the initialization marker.

L3 stability budget — why this isn't P21

P11 already covers "validate by interacting" with full reflex rules. The gap was operationalization (the how), not coverage. Adding a 21st primitive for the cookbook would consume L3 stability budget (λ₃ ≈ 0.006) for no policy delta. Instead: sub-rule + cookbook + doctor check at L2 operationalization layer.

Promotion gating for §13

The §13 check ships as informational (warn-only). Promotion to policy.yaml blocking gate requires (a) ≥3 documented incidents where missing Dogfood Plan caused user-visible regression, (b) the unambiguous blocking criterion ("PR has no Dogfood Plan in body"), (c) failure mode named (P11 ritual without substance), (d) L3 stability budget available. Logged in research/entities/pattern/bstack-engine.md candidate ledger.

Companion artifact

Human-readable record of /autonomous flow integration at ~/broomva/docs/reports/2026-05-22-autonomous-flow-achieved.html (in workspace repo, separate PR). Per P18 Audience: HTML for human reading; this CHANGELOG + the cookbook + the primitives reference stay markdown for agent loading.


v0.10.0 — Skill-evolution benchmark substrate (BRO-1205)

20 May 20:42
a736a7b

Choose a tag to compare

Skill-evolution benchmark substrate (BRO-1205)

Closes the Empirical (P11) substrate gap. Before 0.10.0, bstack had L3 stability margins (λ₃ ≈ 0.006) and a 20-primitive composition graph but no empirical performance number — every P-primitive promotion was faith-based. 0.10.0 ships the harness that makes those claims falsifiable.

Origin: HKUDS/OpenSpace research dive (see research/entities/project/openspace.md + research/notes/2026-05-20-openspace-evolver-synthesis.md in the consuming workspace). OpenSpace's gdpval_bench/ shipped 4.2× higher earned income vs ClawWork baseline + 46% Phase 2 token reduction on GDPVal; this is the bstack-native port of that substrate, refactored to drop OpenSpace coupling and apply P20 Cross-Review discipline (judge model MUST differ from agent model).

  • NEW scripts/bench/ Python package (stdlib only — zero third-party deps):
    • orchestrator.py — two-phase loop (Phase 1 cold → snapshot skills → Phase 2 warm → compare). Subcommands: run | compare | tasks list | status. Resume support via --resume <run-id>. Budget cap via --budget-usd N (exit 4 when exceeded). All-tasks-failed → exit 6.
    • task_loader.pyTask dataclass + JSONL loader. BSTACK_BENCH_TASKS_DIR env override for tests.
    • agent_runner.pyDryRunRunner (canned, deterministic, $0 cost) + StubLiveRunner (clear NotImplementedError pointing to spec). Pluggable contract for future claude-code/codex/vanilla-anthropic runners.
    • evaluator.pyRubricMatchEvaluator (deterministic rubric checks: has_section / sentence_count_at_least / bullet_count_at_least / contains_any) + StubLLMJudgeEvaluator for the future LLM judge. 0.6 quality cliff (matches OpenSpace + ClawWork policy: quality < 0.6 → payment = 0).
    • tasks/bstack-smoke.jsonl — 3 hand-written bstack-themed tasks (Linear ticket triage, PR diff summary, primitive-symptom matching) with simple rubrics.
  • NEW bin/bstack-bench — bash dispatcher. Mirrors bstack-crystallize's shape. Robust Python interpreter discovery (PATH lookup + well-known absolute install paths for restricted-PATH environments).
  • NEW tests/bench-mvp.test.sh — 14-assertion smoke test verifying: dispatcher --help, task set discovery, exit code shape (2/3/4/5/6), JSONL result schema, comparison + REPORT.md generation, Phase 2 token ratio < 1.0 + Δquality ≥ 0 (canned dry-run deltas), compare without args picks latest, status lists runs, budget cap behavior, live-stub-runner clear migration message, skill-snapshot tarball creation.
  • CHANGED bin/bstack — dispatcher: bench subcommand wired alongside crystallize; usage text updated.
  • CHANGED SKILL.md — Quick start section lists /bstack bench triplet.

Design choices

  • Stdlib only. No anthropic, no litellm, no third-party deps. Matches crystallize.py's discipline. CI runners ship Python 3.10+; macOS dev fallback probes /opt/homebrew/bin/python3.X directly.
  • Dry-run is the default. v0.10.0 ships rubric matching + canned responses; live mode is a stub with a clear migration message ("install anthropic SDK + set ANTHROPIC_API_KEY"). This is the responsible /autonomous path — the substrate is exercisable for free; live mode opt-in flips on when SDK+key are wired in a future PR.
  • Two-phase protocol from OpenSpace, not the SQLite content-snapshot+diff lineage. Git already gives us lineage on entity pages + skills; we don't need OpenSpace's content-snapshot SQLite schema.
  • 0.6 quality cliff preserved (OpenSpace + ClawWork compatibility). Payment is $0 below cliff, full task_value_usd above.
  • Skill snapshot is synthetic in dry-run. _simulate_phase1_skill_dir() mints a tiny fake phase1-skills/ between phases so the snapshot tarball path is exercised without touching ~/.claude/skills/.
  • P20 Cross-Review forward compatibility. The evaluator docstring documents the upcoming constraint: judge model MUST differ from agent model. Enforcement lands when the LLM judge stub is replaced.
  • State location follows bstack convention. ~/.config/bstack/bench/runs/<run-id>/ (mirrors ~/.config/broomva/p7/, ~/.config/broomva/p8-janitor/). Override via BSTACK_BENCH_HOME env var (used by tests).

What this enables

  • Future PRs can wire FIX/DERIVED/RETIRE sub-modes of Crystallize (P16) — extending P16 from CAPTURED-only to all four sub-modes (the OpenSpace decomposition). Per-skill telemetry counters (total_selections / total_applied / total_completions / total_fallbacks) and metric-driven evolution triggers are the next layer; this PR is the measurement substrate they build on.
  • The live runner stub is the integration point for the Anthropic SDK / claude --print subprocess path. Once wired, the same harness runs real-LLM benchmarks against any bstack-instrumented agent.
  • The substrate composes with P9 (p9 watch long-running benches), P12 (persist iterate multi-hour campaigns), and P19 (Orchestrate cube cell selection for bench shape).

Linked artifacts (in the consuming workspace, not in this repo)

  • Spec: bstack/specs/bench-skill-evolution.md
  • Project entity: research/entities/project/openspace.md (9/9 Nous)
  • Concept entity: research/entities/concept/skill-self-evolution.md (7/9 candidate)
  • Synthesis: research/notes/2026-05-20-openspace-evolver-synthesis.md
  • Linear ticket: BRO-1205

Cross-Review (P20) round-1 fixes (applied before merge)

A fresh-context subagent under devil's-advocate brief scored the first push at 5/10 (below the ≥7/10 P20 threshold) and surfaced four correctness defects + dead code. Round 1 closes all four with adversarial tests (one per defect, written to falsify the bug existed, not to confirm the fix works):

  • Defect #1 — evaluator stub leaked raw traceback. _run_task caught NotImplementedError from runner.run but not evaluator.evaluate. --evaluator llm-judge now exits 6 with a clean stderr migration message, mirroring the runner path. Test #15.
  • Defect #2 — budget cap broken on resume. spent initialized at 0.0 in cmd_run, ignoring prior-session costs in the existing JSONL. --resume --budget-usd 0.01 could spend a fresh $0.01 each session. Fixed: when --resume, sum cost_usd from every existing phase-results row before the phase loop. Refuses to start if prior cost already exceeds budget. Test #16.
  • Defect #3 — aggregate double-counted resumed tasks. _read_existing_results returned every JSONL row; a task that failed then re-ran successfully produced two rows under the same task_id, inflating task_count and total_tokens. Fixed: last-write-wins dedup by task_id in _read_existing_results. Resume-completion contract preserved (success row, if any, is last). Test #17.
  • Defect #4 — compare emitted phantom regression on phase-1-only runs. _emit_compare accepted empty phase2 lists and reported "Phase 2 = 0 tokens, Δquality = -0.8" — noise masquerading as data. Fixed: refuse to compare with exit 7 + clear message when either phase is empty. Test #18.
  • Cleanups. Removed dead --workers argparse flag, unused asdict + EvaluationResult imports, dead tasks.set_defaults(func=cmd_tasks) (subparser required=True makes it unreachable), and the # pragma: no cover (defensive) slop tell.

Test count: 14 → 18 assertions. New exit code 7 documented in dispatcher + orchestrator headers.

v0.9.5 — Crystallization detection (Phase 7 of substrate completion)

18 May 22:34
c50e982

Choose a tag to compare

Crystallization detection (Phase 7 of substrate completion)

Closes substrate-completion gap 4.4.1 — until now, P16 (rule-of-three crystallization) ran in the user's head. Phase 7 ships machine-assist: bstack crystallize candidates scans docs/conversations/*.md for patterns that recur in ≥3 distinct sessions with explicit failure-mode and acknowledgement signals. Candidates are surfaced for human approval; the substrate never auto-promotes a primitive.

  • NEW scripts/crystallize.py — rule-of-three pattern detector (Python). Heuristics per spec §6 Phase 7:
    • Phrase recurs in ≥--min-sessions distinct conversation files (default 3)
    • ≥1 occurrence co-locates within a 200-char window of a failure-mode keyword (failed, orphaned, race, regression, shipped broken, …)
    • ≥1 occurrence co-locates with a repetition-acknowledgement keyword (again, twice, third time, recurring, had to redo, …)
    • Substring suppression: keep shorter phrase when it strictly recurs more often than a longer one (the longer is a phrasing variant; the shorter is the recurring kernel)
    • n-gram window: 2–4 tokens; 2-grams require both tokens be content (no stop-words); 3+-grams reject stop-word prefix or suffix and require ≥⌈n/2⌉ content tokens
    • Citation excerpts are scrubbed for common secret patterns (sk-…, ghp_…, xoxb-…, AWS/GCP keys, JWTs, generic password:/token: assignments) before emission — excerpts may flow into PR comments and CI artifacts
    • Failure/ack keyword lists overridable via CRYSTALLIZE_FAILURE_KEYWORDS / CRYSTALLIZE_ACK_KEYWORDS env vars
  • NEW bin/bstack-crystallize — thin bash dispatcher; delegates to scripts/crystallize.py. Defaults --conversations to $BSTACK_CONVERSATIONS$BROOMVA_WORKSPACE/docs/conversations$PWD/docs/conversations.
  • NEW bstack crystallize candidates [--json] [--conversations <dir>] [--min-sessions <N>] [--limit <N>] — surface detected candidates with citations + signal summaries.
  • NEW bstack crystallize promote <slug> [--json] — draft a primitive scaffold (auto-detected pattern + failure mode + ack signals + citations + P16 manual-gate checklist). Explicitly does not auto-merge a primitive; the scaffold is a starting point, not a decision.
  • NEW tests/canary/05-crystallize.test.sh — 14 assertions covering: fixtures present, --help lists both subcommands, --json shape, known squash-merge-race pattern surfaces at ≥3 sessions, --min-sessions=99 returns 0 (no false positives), promote scaffold contains DRAFT + auto-merge disclaimer + P16 reference, unknown subcommand exits 2, missing conversations directory exits 3, unknown promote slug exits 4.
  • NEW tests/fixtures/conversations/{positive-1..4,negative-1..2}.md — 4 fixtures sharing the squash merge race rule-of-three pattern + 2 negative fixtures (no recurrence / no failure-mode signal).
  • CHANGED bin/bstack — dispatcher: crystallize subcommand wired alongside skills; usage text updated.

Design choices

  • Detection, not promotion. Phase 7's contract is that P16 stays a deliberate human decision. The scaffold output explicitly disclaims auto-merge and surfaces the four manual P16 gates (concrete mechanism, stated invariant, stated failure mode, short name).
  • Bounded false-positive risk. Per the spec §8 risk table, the failure mode is false-positive ritual detection. Mitigation: candidates are surfaced for human review; the substring-suppression rule keeps the recurring kernel rather than every phrasing variant; the failure-mode + ack co-occurrence filter rejects phrases that recur without an actual problem signal.
  • No new dependencies. Pure Python stdlib + standard jq (already a canary-suite dependency). The detector runs on the same Python interpreter that runs scripts/wave.py and the measure-*.sh setpoint scripts.

SLO targets (introduced)

  • bstack crystallize candidates (fixtures, ~6 files): p50 < 200ms, p99 < 1s
  • bstack crystallize candidates (workspace, ~50 files): p50 < 2s, p99 < 5s
  • bstack crystallize promote <slug>: p50 < 200ms, p99 < 1s (re-runs detection then formats one candidate)

Exit codes

  • 0 success (zero or more candidates surfaced)
  • 2 invalid arguments / unknown subcommand
  • 3 conversations directory missing
  • 4 promote slug not found in current candidate set

Out of scope for v0.9.5 (deferred)

  • Setpoint-history-driven trend detection (open question §10.1)
  • Cross-workspace candidate aggregation (depends on Phase 8 federation)
  • Auto-PR drafting via gh pr create from crystallize promote (deliberate — keeps P16 a manual decision)

v0.9.0 — Vendored upgrade path + canary suite (Phase 6 of substrate completion)

18 May 21:52
fd4843b

Choose a tag to compare

Vendored upgrade path + canary suite (Phase 6 of substrate completion)

Closes two v1.0 blockers from the substrate completion spec (§4.3.2, §4.6.2). Vendored installs (npx skills add produces these — no .git) can now self-upgrade via release tarball + sha256 verification + atomic swap. The canary suite verifies the substrate's load-bearing contracts hold on a fresh install — runs on every PR.

  • NEW bstack upgrade --self for vendored installs (extends bin/bstack bstack_upgrade_vendored):
    • Downloads bstack-vX.Y.Z.tar.gz from the GitHub Release
    • Downloads matching .sha256 sidecar
    • Mandatory sha256 verification — no --skip-sha256 flag; fail-closed on mismatch
    • Atomic swap via mv current → .bak, mv new → install; rollback on swap failure
    • BSTACK_DRY_RUN=1 env override prints the plan without writing
    • Falls back to manual npx skills add guidance if tarball missing (pre-v0.9.0 releases)
    • Structured log at ~/.bstack/auto-upgrade.log
  • CHANGED .github/workflows/release.yml — new Package + publish vendored upgrade tarball step:
    • Builds bstack-vX.Y.Z.tar.gz from the in-repo skill payload (excludes .git, .github, tests, worktree dirs)
    • Uses tar --sort=name for byte-deterministic tarballs
    • Computes sha256, uploads tarball + sha256 sidecar via gh release upload --clobber
  • NEW tests/canary/01-fresh-bootstrap.test.sh — Plant Contract verification on a fresh workspace (10 assertions: bootstrap exits 0, governance files scaffold, hooks wired for SessionStart/Stop/PreToolUse, doctor produces expected summary, doctor exits 0 per HC-1)
  • NEW tests/canary/02-metrics-pipeline.test.sh — Phase 1 (v0.4.0) end-to-end: collect produces valid JSON, latest.json written, observe single-setpoint returns id-matched output
  • NEW tests/canary/03-status-surface.test.sh — Phase 2 (v0.5.0) end-to-end: 7 core sections render, --json shape valid, --setpoint deep-view, --aggregate Phase 8 placeholder
  • NEW tests/canary/04-schemas-validate.test.sh — Phase 3 (v0.6.0) contracts: 4 schemas valid draft-07, primitives.yaml validates, companion-skills.yaml validates, policy.yaml.template validates (top-level shape + flat-schema parts)
  • CHANGED .github/workflows/ci.yml — new canary job gated on lint + doctor; installs jq + jsonschema + PyYAML; runs tests/canary/*.test.sh

SLO targets (introduced)

  • bstack upgrade --self (vendored, cold network): p50 < 30s, p99 < 60s
  • canary suite (4 tests, sequential): p99 < 30s

Supply-chain safety

  • sha256 verification mandatory — no bypass flag
  • Atomic swap with .bak rollback on failure
  • Tarball excludes ephemeral state and CI tooling — only the canonical skill payload ships
  • Backup retained until swap completes successfully

Out of scope for v0.9.0 (deferred to v0.9.1)

  • Cosign signature verification — sha256 covers the principal integrity concern; cosign adds publisher identity verification
  • bstack reproduce subcommand (drift detection vs fresh-install reference)
  • Canary tests 05-08 (skills auto-install, gates audit, release pipeline E2E) — ship as Phase 4-6 deliverables stabilize