From 82c5985132d17ab4ae8f32f4f7b295e7918b37d2 Mon Sep 17 00:00:00 2001 From: Corey Ryan Dean Date: Tue, 9 Jun 2026 16:45:18 -0500 Subject: [PATCH] fix(runtime): make Delete Each walk-safe + erase GC refs on object delete Root-causes the long-standing downstream (rcce2) CI flake: intermittent exit-time access violations in ItemsTest / OnlinePlayerChainTest, historically mislabeled "Stack overflow!" by the pre-fix seTranslator. Two defects, both confirmed by evidence: 1. _bbObjDeleteEach captured next=obj->next before _bbObjDelete(obj), but the delete's field-release cascade can drop another node's ref count to zero and relocate it from used to free -- including the captured next (chain shape: a zombie kept alive only by its predecessor's field). The walk then terminates at the free sentinel, silently leaving the rest of the list undeleted, or wanders invalid linkage. Deterministic repro: a 1000-node chain with every other node zombified left 499 survivors. Fixed with a restart-on-delete walk (re-read used.next after every delete; step over zombies in place). 2. Delete / Delete Each never erased the pointer from the GC reference_map, so a later _bbRelease on the stale entry re-ran _bbObjDelete against a freed -- possibly recycled -- slot, releasing garbage fields (heap corruption surfacing at exit). _bbObjDelete now erases its pointer from reference_map; the GC's own delete path already erased before calling it, so that path double-erases as a no-op. Also upgrades seTranslator to append the exception code, module-relative faulting address, and AV access kind/target to panic messages -- the instrumentation that produced the evidence (pre-fix baseline: 2/100 ItemsTest, 12/100 OnlinePlayerChainTest failures, all 0xC0000005 at a stable program-relative offset). Co-Authored-By: Claude Fable 5 --- docs/notes/exit-teardown-crash.md | 87 +++++++++++++++++++++ src/blitzrc/bbruntime/basic.cpp | 28 ++++++- src/blitzrc/bbruntime_dll/bbruntime_dll.cpp | 42 ++++++++++ tests/DeleteEachChainTest.bb | 78 ++++++++++++++++++ tests/DeleteEachGCRefMapTest.bb | 65 +++++++++++++++ 5 files changed, 297 insertions(+), 3 deletions(-) create mode 100644 docs/notes/exit-teardown-crash.md create mode 100644 tests/DeleteEachChainTest.bb create mode 100644 tests/DeleteEachGCRefMapTest.bb diff --git a/docs/notes/exit-teardown-crash.md b/docs/notes/exit-teardown-crash.md new file mode 100644 index 0000000..1292001 --- /dev/null +++ b/docs/notes/exit-teardown-crash.md @@ -0,0 +1,87 @@ +# Working note — exit-time teardown crash (the rcce2 ItemsTest / chain-test CI flake) + +Branch: `fix/exit-teardown-crash` +Type: Bug / Reliability + +## Goal (restated) + +rcce2 CI has a long-standing intermittent crash: `ItemsTest.bb` dies with +"Stack overflow!" and `OnlinePlayerChainTest.bb` with "Memory access violation" +*after all tests pass*, sometimes with `Active objects: 0` already printed. +Two rcce2-side mitigations (PR #313 start-of-test pool sweeps, PR #322 trailing +`Delete Each` teardown sweeps) reduced the rate from ~30-40% to ~5-7% but the +root cause was deferred to "the BlitzCC runtime pool-walker fix" and never +diagnosed. The historical "Stack overflow!" label proves nothing — until the +seTranslator fall-through fix, *every* native fault carried that label; the +real fault class was never observed. + +Goal: identify the mechanism with evidence (fault address, deterministic +repro), fix it minimally in the runtime, and pin it with regression tests. + +## Suspected mechanisms (to confirm or kill, in order) + +1. **Dual-refcount desync (GC layer vs classic pool)** — `_bbObjDeleteEach` / + `_bbObjDelete` (`bbruntime/basic.cpp:288-323`) maintain `BBObj::ref_cnt` + and the used/free pool lists, but never touch the GC `reference_map`. + A `Delete Each T` sweep (exactly what the rcce2 band-aids added) frees / + recycles objects whose pointers remain counted in `reference_map`. When a + GC-tracked reference later releases (`_bbRelease(vPtr,"BBCustom")`, + basic.cpp:769-815), count hits 0 and `_bbObjDelete` runs **on freed or + recycled memory** — releasing garbage STR/OBJ/VEC "fields" → heap + corruption → delayed AV at exit (e.g. `basic_destroy`'s `usedStrs` walk), + counters already 0. Intermittency = whether the slot was recycled and what + now occupies it. +2. **`_bbObjDeleteEach` captured-next invalidation** — the walk captures + `next=obj->next` before `_bbObjDelete(obj)`; the delete's field-release + cascade can drop `next`'s ref_cnt to 0 and move it to the free list + (chain shape: each node holds the only ref to the next — exactly the + chain tests). The walk then continues through free-list linkage. +3. Genuine recursion depth in a release cascade (considered least likely: + `_bbObjDelete`'s field loop is iterative; only nested object graphs + recurse one level per edge). + +## Approach + +1. **Instrument first**: extend `seTranslator` (bbruntime_dll.cpp) to append + the exception code, faulting instruction address, and module-relative + offset (module name + base via `VirtualQuery`/`GetModuleFileName`) to the + panic string. Permanent diagnostic upgrade; the `-t` path prints it to + stdout where CI logs capture it. +2. **Deterministic probes** (uncommitted scratch .bb files first): (a) chain + `Delete Each` survivor/ordering probe, no-GC; (b) GC-mode probe: build + chain, `Delete Each`, force recycling via new allocations, then let + GC-tracked references release. If a probe crashes or misbehaves + deterministically, the mechanism is pinned without statistics. +3. **Statistical baseline**: loop rcce2's `test.bat ItemsTest` and + `test.bat OnlinePlayerChainTest` (>=100 runs each) against the + instrumented runtime; record failure rate + fault offsets; map offsets to + functions via the linker map / dumpbin. +4. **Fix minimally** at the confirmed site. Candidate shapes: erase the + pointer from `reference_map` inside `_bbObjDelete` (closing the desync); + restart-walk in `_bbObjDeleteEach` (mirrors rcce2's own + "Restart-on-Delete" iterator doctrine). Only what evidence supports. +5. **Pin**: BlitzForge `tests/` regression (long-chain create/Delete Each/ + exit, both GC and no-GC shapes); rerun the statistical loop (>=200 clean). +6. **Land**: BlitzForge PR → merge → rcce2 submodule-bump PR → update rcce2 + CLAUDE.md flake note. + +## Fallback floor (if root cause resists one iteration) + +The instrumentation, measured baseline, probe results, and this note merge +anyway — converting an opaque flake into an addressable bug with evidence. + +## Files touched (planned) + +- `src/blitzrc/bbruntime_dll/bbruntime_dll.cpp` — seTranslator diagnostics. +- `src/blitzrc/bbruntime/basic.cpp` — the fix (site TBD by evidence). +- `tests/` — regression test(s). +- this note. + +## Acceptance criteria + +- Panic messages include exception code + module-relative fault address. +- Root cause identified from evidence (address/probe), not the symptom label. +- Baseline failure rate measured pre-fix; >=200 consecutive clean runs of both + flaky rcce2 tests post-fix. +- BlitzForge suite green; full rcce2 compile + suite green on the bump. +- rcce2 CLAUDE.md "Known intermittent flake" updated. diff --git a/src/blitzrc/bbruntime/basic.cpp b/src/blitzrc/bbruntime/basic.cpp index 0d3268c..01f4e40 100644 --- a/src/blitzrc/bbruntime/basic.cpp +++ b/src/blitzrc/bbruntime/basic.cpp @@ -308,17 +308,39 @@ void _bbObjDelete( BBObj *obj ){ handle_map.erase( it->second ); object_map.erase( it ); } + // GC desync guard: Delete / Delete Each reach here without going through + // the GC layer, leaving the pointer counted in reference_map. A later + // _bbRelease on that stale entry would re-run _bbObjDelete against a + // freed -- possibly recycled -- slot and release garbage "fields" + // (heap corruption surfacing as the intermittent exit-time crash). + // Erasing here makes the stale release a no-op; the GC's own delete + // path erases before calling us, so this is a no-op double-erase there. + reference_map.erase( reinterpret_cast(obj) ); obj->fields=0; _bbObjRelease( obj ); --objCnt; } void _bbObjDeleteEach( BBObjType *type ){ + // Restart-on-delete walk. The field-release cascade inside _bbObjDelete + // can drop another node's ref count to zero and move it from `used` to + // `free` -- including the would-be captured `next` (chain shape: a + // zombie kept alive only by its predecessor's field). A captured-next + // walk that steps onto a node relocated to the free list terminates at + // the free sentinel, silently leaving the rest of the list undeleted + // (deterministic repro: tests/DeleteEachChainTest.bb), or wanders + // invalid linkage. Re-reading used.next after every delete only ever + // observes live list state; zombies (fields==0) are stepped over + // in-place, and each delete removes at least one fielded node from + // `used`, so the walk terminates. BBObj *obj=type->used.next; while( obj->type ){ - BBObj *next=obj->next; - if( obj->fields ) _bbObjDelete( obj ); - obj=next; + if( obj->fields ){ + _bbObjDelete( obj ); + obj=type->used.next; + }else{ + obj=obj->next; + } } } diff --git a/src/blitzrc/bbruntime_dll/bbruntime_dll.cpp b/src/blitzrc/bbruntime_dll/bbruntime_dll.cpp index d965736..65a4a87 100644 --- a/src/blitzrc/bbruntime_dll/bbruntime_dll.cpp +++ b/src/blitzrc/bbruntime_dll/bbruntime_dll.cpp @@ -52,6 +52,27 @@ static void killer(){ } #endif +// Describe a code/data address as "module+0xOFFSET" when it falls inside a +// loaded module, else as a raw pointer. Module-relative offsets stay stable +// across ASLR runs, so an intermittent fault becomes attributable to a +// specific function via the linker map even when it only fires 1-in-20 runs. +static string describeAddress( void *addr ){ + char buf[160]; + MEMORY_BASIC_INFORMATION mbi; + if( addr && VirtualQuery( addr,&mbi,sizeof(mbi) ) && mbi.AllocationBase ){ + char path[MAX_PATH]; + DWORD n=GetModuleFileName( (HMODULE)mbi.AllocationBase,path,MAX_PATH ); + if( n ){ + const char *base=path; + for( const char *p=path;*p;++p ) if( *p=='\\'||*p=='/' ) base=p+1; + sprintf( buf,"%s+0x%X",base,(unsigned int)((char*)addr-(char*)mbi.AllocationBase) ); + return buf; + } + } + sprintf( buf,"0x%p",addr ); + return buf; +} + static void _cdecl seTranslator( unsigned int u,EXCEPTION_POINTERS* pExp ){ string panicStr = "Unknown runtime exception"; @@ -70,6 +91,27 @@ static void _cdecl seTranslator( unsigned int u,EXCEPTION_POINTERS* pExp ){ break; } + // Append fault diagnostics: exception code, faulting instruction + // (module-relative), and for access violations the access kind and + // target address. Without this every intermittent native crash is an + // unattributable one-liner; with it the offset maps to a function via + // the linker map. Stack-overflow handling stays minimal on purpose -- + // describeAddress allocates stack, and the guard page is already blown. + if( pExp && pExp->ExceptionRecord && u!=EXCEPTION_STACK_OVERFLOW ){ + EXCEPTION_RECORD *er=pExp->ExceptionRecord; + char info[64]; + sprintf( info," [code 0x%08X at ",(unsigned int)er->ExceptionCode ); + panicStr+=info; + panicStr+=describeAddress( er->ExceptionAddress ); + if( u==EXCEPTION_ACCESS_VIOLATION && er->NumberParameters>=2 ){ + const char *kind=er->ExceptionInformation[0]==0 ? "reading" : + (er->ExceptionInformation[0]==1 ? "writing" : "executing"); + sprintf( info,", %s 0x%08X",kind,(unsigned int)er->ExceptionInformation[1] ); + panicStr+=info; + } + panicStr+="]"; + } + bbruntime_panic( panicStr.c_str() ); } diff --git a/tests/DeleteEachChainTest.bb b/tests/DeleteEachChainTest.bb new file mode 100644 index 0000000..625e149 --- /dev/null +++ b/tests/DeleteEachChainTest.bb @@ -0,0 +1,78 @@ +Strict + +// Regression: _bbObjDeleteEach captured-next invalidation. +// +// `Delete Each T` used to capture `next = obj->next` before deleting obj. +// The field-release cascade inside _bbObjDelete can drop another node's +// ref count to zero and move it from the type's `used` list to its `free` +// list -- including that captured `next`. The classic trigger is a chain +// where a zombie (explicitly Deleted, fields gone) is kept alive only by +// its predecessor's field: deleting the predecessor releases the zombie +// to the free list, the walk steps onto it, follows free-list linkage to +// the free sentinel (type == 0), and terminates -- silently leaving the +// rest of the list undeleted. Before the fix the zombie-chain test below +// left 499 of 1000 nodes alive. +// +// This is the runtime-side root of the long-standing rcce2 CI flake +// (ItemsTest / chain-test exit crashes): early-terminated or free-list- +// wandering sweeps. Fixed by a restart-on-delete walk in _bbObjDeleteEach. + +Type ChainNode + Field nxt.ChainNode +End Type + +Function liveChainNodes%() + Local n% = 0 + Local s.ChainNode = Null + For s = Each ChainNode + n = n + 1 + Next + Return n +End Function + +Function buildChain.ChainNode(count%) + Local head.ChainNode = New ChainNode() + Local cur.ChainNode = head + Local i% + For i = 2 To count + cur\nxt = New ChainNode() + cur = cur\nxt + Next + Return head +End Function + +// Plain forward chain: every node must be swept. +Test deleteEachFullChain() + Local head.ChainNode = buildChain(1000) + head = Null + Delete Each ChainNode + Assert(liveChainNodes() = 0) +End Test + +// Zombie-interleaved chain: Delete every other node first (each zombie's +// memory survives only through its predecessor's field), then sweep. The +// pre-fix walk terminated at the first zombie released mid-sweep. +Test deleteEachZombieChain() + Local head.ChainNode = buildChain(1000) + + Local z.ChainNode = head\nxt + While z <> Null + Local znext.ChainNode = Null + If z\nxt <> Null Then znext = z\nxt\nxt Else znext = Null + Delete z + z = znext + Wend + + head = Null + Delete Each ChainNode + Assert(liveChainNodes() = 0) +End Test + +// Sweep twice: the second sweep must see an empty list and no stale state. +Test deleteEachIdempotent() + Local head.ChainNode = buildChain(50) + head = Null + Delete Each ChainNode + Delete Each ChainNode + Assert(liveChainNodes() = 0) +End Test diff --git a/tests/DeleteEachGCRefMapTest.bb b/tests/DeleteEachGCRefMapTest.bb new file mode 100644 index 0000000..3933a3f --- /dev/null +++ b/tests/DeleteEachGCRefMapTest.bb @@ -0,0 +1,65 @@ +Strict +EnableGC + +// Regression: dual-refcount desync between the classic object pool and the +// GC reference_map. +// +// Delete / Delete Each free pool-side objects directly (_bbObjDelete) and +// used to leave the pointer still counted in the GC reference_map. A later +// _bbRelease on that stale entry hit count==0 and re-ran _bbObjDelete +// against a freed -- possibly recycled -- slot, releasing garbage "fields" +// (heap corruption surfacing as the intermittent exit-time access +// violation in downstream rcce2 CI: ItemsTest / OnlinePlayerChainTest). +// _bbObjDelete now erases the pointer from reference_map, so RefCount on a +// pool-deleted object must read 0 and the eventual scope-end release must +// be a no-op. + +Type RefMapNode + Field nxt.RefMapNode + Field tag$ +End Type + +Test gcRefMapClearedByDeleteEach() + Local n.RefMapNode = New RefMapNode() + n\tag = "tracked" + Assert(RefCount(n) >= 1) + Delete Each RefMapNode + // Pre-fix this read the stale reference_map count (>= 1). + Assert(RefCount(n) = 0) +End Test + +Test gcRefMapClearedByExplicitDelete() + Local n.RefMapNode = New RefMapNode() + Assert(RefCount(n) >= 1) + Delete n + Assert(RefCount(n) = 0) +End Test + +// Sweep + recycle + release: the historical crash shape. With the erase in +// place the stale releases are no-ops and the recycled occupants survive. +Test gcStaleReleaseAfterRecycleIsBenign() + Local head.RefMapNode = New RefMapNode() + Local cur.RefMapNode = head + Local i% + For i = 1 To 200 + cur\nxt = New RefMapNode() + cur\nxt\tag = "n" + i + cur = cur\nxt + Next + cur = Null + Delete Each RefMapNode + + // Recycle the freed slots with new, differently-tagged occupants. + Local keep.RefMapNode = New RefMapNode() + keep\tag = "occupant" + Local j% + For j = 1 To 200 + Local filler.RefMapNode = New RefMapNode() + filler\tag = "f" + j + filler = Null + Next + + // Stale release of the swept chain head: must not touch the occupants. + head = Null + Assert(keep\tag = "occupant") +End Test