fix(sdlc): close the dead-lane-after-PR loop — merge-watcher MERGED + headless teardown + dispatch idempotency#3831
Conversation
…headless teardown + dispatch idempotency) Three compounding bugs stranded merged work and stacked zombie launchers: 1. merge-watcher reconcile skipped MERGED. reconcile_stale_pr_states only acted on CLOSED PRs, so a pr_open/merge_queue task whose PR merged outside the run_watcher cursor window never self-healed (soundcloud #3740, meeting-in-screwm #3751 stranded for days). Now MERGED auto-closes (cursor-window independent) and pr:null re-derives from branch via gh pr list --head, or blocks with a reason. 2. headless loop had no merge-aware teardown. The while-true respawn loop re-injected a generic restart prompt forever, never checking whether the task closed. Added task_is_terminal (claim cleared / note left active/ / terminal status / PR merged) so the loop exits cleanly; the lane-supervisor then brings the lane back read-only into idle-await, not a zombie respawn. 3. no dispatch idempotency. A second hapax-claude-headless for a live lane clobbered the lane-keyed stdin pipe + pidfile and stacked zombies (the reboot storm + the supervisor firing during a restart-backoff window). A lane-keyed flock makes the wrapper the single launcher per lane; the lock auto-releases on exit, so the supervisor stays the sole legitimate respawn authority. Tests: reconcile MERGED auto-close + pr:null repair (8); wrapper teardown on claim-clear / terminal-status / PR-merged + duplicate-launcher refusal (7). Task: reform-improve-deadlane-closeout-20260601 AuthorityCase: CASE-CROSS-RUNTIME-COMMS-001 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR expands task-note state reconciliation and adds merge-aware launcher termination detection. The merge watcher gains new query helpers, state-transition logic, and a reworked reconciliation function that scans all active notes, resolves PR states, and repairs null-PR entries. The headless launcher adds per-lane de-duplication via flock and merge-aware task termination to avoid infinite respawn loops. Both systems receive comprehensive test coverage. ChangesTask PR Lifecycle Reconciliation and Launcher Awareness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
💡 Codex Review
hapax-council/scripts/cc-pr-merge-watcher.py
Lines 695 to 699 in 88062b8
When a task is closed by the new cursor-independent reconcile_stale_pr_states path (the missed-window pr_open/merge_queue case this change is meant to fix), trigger_reform_dispatch is not called because the only dispatch nudge happens before this call and only checks run_watcher's counters["closed"]. In runs where no in-cursor PRs closed but this reconciler closes a merged PR, the task is moved to closed without the event-driven dispatcher nudge, so dependent work remains idle until the periodic RTE poll instead of advancing immediately.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if pr_state == "MERGED": | ||
| if _close_merged_note( | ||
| note, text, pr_num, repo_root=repo_root, dry_run=dry_run, runner=runner | ||
| ): | ||
| counts["closed"] += 1 |
There was a problem hiding this comment.
Honor the hygiene killswitch before closing reconciled tasks
When HAPAX_CC_HYGIENE_OFF=1 is set, run_watcher skips, but main() still calls reconcile_stale_pr_states; with this new MERGED branch that means a supposedly disabled hygiene run can still invoke cc-close and move active task notes to closed/. The file header describes this killswitch as skipping the watcher entirely, so operationally disabling cc hygiene no longer prevents mutations for any active pr_open/merge_queue task whose PR has already merged.
Useful? React with 👍 / 👎.
| if ! flock -n "$LAUNCHER_LOCK_FD"; then | ||
| existing="$(cat "$LAUNCHER_PIDFILE" 2>/dev/null || echo '?')" | ||
| echo "hapax-claude-headless: refusing duplicate launcher for lane '$ROLE' (live launcher pid=$existing holds $LAUNCHER_LOCK)" >&2 | ||
| exit 16 |
There was a problem hiding this comment.
Count a locked launcher as live for supervisor respawns
When the lane supervisor fires while an incumbent wrapper is in restart backoff, scripts/hapax-lane-supervisor still sees the lane as dead because it only checks the child $lane.pid/tmux state; this new lock then makes the duplicate launcher exit 16. The supervisor backgrounds headless respawns and records the attempt anyway, so repeated backoff-window ticks can consume the burst budget and later suppress a real restart even though no replacement launcher was started.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/scripts/test_hapax_claude_headless.py (1)
12-16: ⚡ Quick winAdd explicit encoding for consistency.
The existing code at line 95 uses
encoding="utf-8"explicitly, but this helper omits it. For consistency and to avoid locale-dependent behavior:def _stub_bin(bin_dir: Path, name: str, body: str) -> None: path = bin_dir / name - path.write_text("#!/usr/bin/env bash\n" + textwrap.dedent(body)) + path.write_text("#!/usr/bin/env bash\n" + textwrap.dedent(body), encoding="utf-8") path.chmod(0o755)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/scripts/test_hapax_claude_headless.py` around lines 12 - 16, The helper function _stub_bin writes script files without an explicit encoding; update the Path.write_text call inside _stub_bin to include encoding="utf-8" so the file is written deterministically across locales—locate the _stub_bin(bin_dir: Path, name: str, body: str) function and change the path.write_text invocation to pass encoding="utf-8" along with the script content.scripts/hapax-claude-headless (1)
186-194: 💤 Low valueConsider avoiding
lsin command substitution for glob matching.The
ls ... | head -n1pattern is fragile with filenames containing newlines or special characters. A safer alternative uses bash glob expansion directly:find_active_note() { local task="$1" note="" if [[ -f "$CC_TASK_ACTIVE/$task.md" ]]; then note="$CC_TASK_ACTIVE/$task.md" else local matches=("$CC_TASK_ACTIVE/$task-"*.md) [[ -f "${matches[0]}" ]] && note="${matches[0]}" fi printf '%s' "$note" }Low risk here since task IDs are controlled, but worth aligning with idiomatic bash.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/hapax-claude-headless` around lines 186 - 194, The find_active_note function uses ls piped to head for glob matching which breaks on special/newline characters; replace that command-substitution with safe bash glob expansion: in the else branch create an array like matches=("$CC_TASK_ACTIVE/$task-"*.md) and if [[ -f "${matches[0]}" ]] set note="${matches[0]}", leaving the existing if-check for "$CC_TASK_ACTIVE/$task.md" intact so filenames with spaces/newlines are handled safely; update references to CC_TASK_ACTIVE, task, and note inside find_active_note accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@scripts/cc-pr-merge-watcher.py`:
- Around line 563-565: The regex used in the re.sub call only matches the
literal "null" and misses other YAML null representations; update the
substitution to match the full set of nullish values (use the existing
_PR_NULL_NULLISH token or expand to a non-capturing group like
(?:~|none|null|''|\"\")) so the replacement targets any nullish value; modify
the re.sub invocation that produces new_text (the current line with
re.sub(r"^pr:\s*null\s*$", f"pr: {pr_number}", text, count=1,
flags=re.MULTILINE)) to use the new pattern (e.g.
re.sub(rf"^pr:\s*(?:{_PR_NULL_NULLISH})\s*$", f"pr: {pr_number}", text, count=1,
flags=re.MULTILINE)) so note.write_text and LOG.info behavior remain unchanged.
---
Nitpick comments:
In `@scripts/hapax-claude-headless`:
- Around line 186-194: The find_active_note function uses ls piped to head for
glob matching which breaks on special/newline characters; replace that
command-substitution with safe bash glob expansion: in the else branch create an
array like matches=("$CC_TASK_ACTIVE/$task-"*.md) and if [[ -f "${matches[0]}"
]] set note="${matches[0]}", leaving the existing if-check for
"$CC_TASK_ACTIVE/$task.md" intact so filenames with spaces/newlines are handled
safely; update references to CC_TASK_ACTIVE, task, and note inside
find_active_note accordingly.
In `@tests/scripts/test_hapax_claude_headless.py`:
- Around line 12-16: The helper function _stub_bin writes script files without
an explicit encoding; update the Path.write_text call inside _stub_bin to
include encoding="utf-8" so the file is written deterministically across
locales—locate the _stub_bin(bin_dir: Path, name: str, body: str) function and
change the path.write_text invocation to pass encoding="utf-8" along with the
script content.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: de0df1e4-9354-47b1-af4f-3e7c0b5f3d68
📒 Files selected for processing (4)
scripts/cc-pr-merge-watcher.pyscripts/hapax-claude-headlesstests/scripts/test_hapax_claude_headless.pytests/test_cc_pr_merge_watcher.py
| new_text = re.sub(r"^pr:\s*null\s*$", f"pr: {pr_number}", text, count=1, flags=re.MULTILINE) | ||
| note.write_text(new_text, encoding="utf-8") | ||
| LOG.info("stale PR drain: %s -> re-derived PR #%s from branch %s", note.stem, pr_number, branch) |
There was a problem hiding this comment.
Regex only handles pr: null, not other nullish values.
The substitution pattern ^pr:\s*null\s*$ only matches the literal string "null", but _PR_NULL_NULLISH includes other YAML null representations like ~, none, and empty string. If a note has pr: ~, the PR number won't be written back correctly.
Proposed fix
- new_text = re.sub(r"^pr:\s*null\s*$", f"pr: {pr_number}", text, count=1, flags=re.MULTILINE)
+ # Handle all YAML null representations (null, ~, none, empty)
+ new_text = re.sub(r"^pr:\s*(null|~|none|)\s*$", f"pr: {pr_number}", text, count=1, flags=re.MULTILINE)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| new_text = re.sub(r"^pr:\s*null\s*$", f"pr: {pr_number}", text, count=1, flags=re.MULTILINE) | |
| note.write_text(new_text, encoding="utf-8") | |
| LOG.info("stale PR drain: %s -> re-derived PR #%s from branch %s", note.stem, pr_number, branch) | |
| # Handle all YAML null representations (null, ~, none, empty) | |
| new_text = re.sub(r"^pr:\s*(null|~|none|)\s*$", f"pr: {pr_number}", text, count=1, flags=re.MULTILINE) | |
| note.write_text(new_text, encoding="utf-8") | |
| LOG.info("stale PR drain: %s -> re-derived PR #%s from branch %s", note.stem, pr_number, branch) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@scripts/cc-pr-merge-watcher.py` around lines 563 - 565, The regex used in the
re.sub call only matches the literal "null" and misses other YAML null
representations; update the substitution to match the full set of nullish values
(use the existing _PR_NULL_NULLISH token or expand to a non-capturing group like
(?:~|none|null|''|\"\")) so the replacement targets any nullish value; modify
the re.sub invocation that produces new_text (the current line with
re.sub(r"^pr:\s*null\s*$", f"pr: {pr_number}", text, count=1,
flags=re.MULTILINE)) to use the new pattern (e.g.
re.sub(rf"^pr:\s*(?:{_PR_NULL_NULLISH})\s*$", f"pr: {pr_number}", text, count=1,
flags=re.MULTILINE)) so note.write_text and LOG.info behavior remain unchanged.
Problem
Three compounding bugs left dead-lane work stranded and stacked zombie launchers (audit wv7nc21ym, witnessed repeatedly this session):
reconcile_stale_pr_statesonly acted onCLOSEDPRs — apr_open/merge_queuetask whose PR merged outside therun_watchercursor window never self-healed (live:soundcloudfix(music): SoundCloud adapter 401-resilience — fresh client_id scrape; never empty playlist on auth failure #3740,meeting-in-screwmfeat(screwm): outbound Meet camera — live render to /dev/video50 #3751 stranded for days). Apr: null+pr_opennote matched no reconciler at all.hapax-claude-headlessran an unconditionalwhile truethat re-injected a generic restart prompt forever, never checking whether its task had closed.hapax-claude-headlessfor a live lane clobbered the lane-keyed$ROLE.stdin/$ROLE.pidand stacked zombies (the reboot storm; the lane-supervisor firing during a restart-backoff window).Fix
reconcile_stale_pr_states:MERGED->cc-close(cursor-window independent);pr: null-> re-derive the PR frombranch:viagh pr list --head, write it back and act on its state, or block with a reason. Addsclosed/repairedcounters.task_is_terminaldecides, before each respawn, whether the task is closed (claim cleared / note leftactive// terminal status) or its PR merged — the loop then exits cleanly and the cleanup trap reaps the pipe + lock. The lane-supervisor brings the lane back read-only into idle-await on its next tick, not a zombie respawn.flockmakes the wrapper the single launcher per lane; the lock auto-releases on process exit, so the supervisor stays the sole legitimate respawn authority and a dead launcher needs no stale-pid reaping.The three compose: #1 closes a merged task -> clears its claim -> #2 tears the loop down -> #3 prevents a duplicate from ever racing it.
Tests
tests/test_cc_pr_merge_watcher.py: MERGED auto-close, close-failure non-fatal, dry-run, CLOSED-still-blocks, OPEN-left-alone, pr:null repair (rederive+close / no-PR-blocks / no-branch-blocks) — 8 new.tests/scripts/test_hapax_claude_headless.py: duplicate-launcher refusal (flock), lock-acquire-when-free, teardown on claim-clear / terminal-status / PR-merged + source guards — 7 new.All green;
ruff check/format,bash -n,shellcheckclean; adjacent suites (lane-supervisor, methodology-dispatch, lane-watchdog: 58) pass.Task:
reform-improve-deadlane-closeout-20260601AuthorityCase: CASE-CROSS-RUNTIME-COMMS-001
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests