Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/notes/exit-teardown-crash.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Working note — exit-time teardown crash (the rcce2 ItemsTest / chain-test CI flake)

Branch: `fix/exit-teardown-crash`
Type: Bug / Reliability

## Goal (restated)

rcce2 CI has a long-standing intermittent crash: `ItemsTest.bb` dies with
"Stack overflow!" and `OnlinePlayerChainTest.bb` with "Memory access violation"
*after all tests pass*, sometimes with `Active objects: 0` already printed.
Two rcce2-side mitigations (PR #313 start-of-test pool sweeps, PR #322 trailing
`Delete Each` teardown sweeps) reduced the rate from ~30-40% to ~5-7% but the
root cause was deferred to "the BlitzCC runtime pool-walker fix" and never
diagnosed. The historical "Stack overflow!" label proves nothing — until the
seTranslator fall-through fix, *every* native fault carried that label; the
real fault class was never observed.

Goal: identify the mechanism with evidence (fault address, deterministic
repro), fix it minimally in the runtime, and pin it with regression tests.

## Suspected mechanisms (to confirm or kill, in order)

1. **Dual-refcount desync (GC layer vs classic pool)** — `_bbObjDeleteEach` /
`_bbObjDelete` (`bbruntime/basic.cpp:288-323`) maintain `BBObj::ref_cnt`
and the used/free pool lists, but never touch the GC `reference_map`.
A `Delete Each T` sweep (exactly what the rcce2 band-aids added) frees /
recycles objects whose pointers remain counted in `reference_map`. When a
GC-tracked reference later releases (`_bbRelease(vPtr,"BBCustom")`,
basic.cpp:769-815), count hits 0 and `_bbObjDelete` runs **on freed or
recycled memory** — releasing garbage STR/OBJ/VEC "fields" → heap
corruption → delayed AV at exit (e.g. `basic_destroy`'s `usedStrs` walk),
counters already 0. Intermittency = whether the slot was recycled and what
now occupies it.
2. **`_bbObjDeleteEach` captured-next invalidation** — the walk captures
`next=obj->next` before `_bbObjDelete(obj)`; the delete's field-release
cascade can drop `next`'s ref_cnt to 0 and move it to the free list
(chain shape: each node holds the only ref to the next — exactly the
chain tests). The walk then continues through free-list linkage.
3. Genuine recursion depth in a release cascade (considered least likely:
`_bbObjDelete`'s field loop is iterative; only nested object graphs
recurse one level per edge).

## Approach

1. **Instrument first**: extend `seTranslator` (bbruntime_dll.cpp) to append
the exception code, faulting instruction address, and module-relative
offset (module name + base via `VirtualQuery`/`GetModuleFileName`) to the
panic string. Permanent diagnostic upgrade; the `-t` path prints it to
stdout where CI logs capture it.
2. **Deterministic probes** (uncommitted scratch .bb files first): (a) chain
`Delete Each` survivor/ordering probe, no-GC; (b) GC-mode probe: build
chain, `Delete Each`, force recycling via new allocations, then let
GC-tracked references release. If a probe crashes or misbehaves
deterministically, the mechanism is pinned without statistics.
3. **Statistical baseline**: loop rcce2's `test.bat ItemsTest` and
`test.bat OnlinePlayerChainTest` (>=100 runs each) against the
instrumented runtime; record failure rate + fault offsets; map offsets to
functions via the linker map / dumpbin.
4. **Fix minimally** at the confirmed site. Candidate shapes: erase the
pointer from `reference_map` inside `_bbObjDelete` (closing the desync);
restart-walk in `_bbObjDeleteEach` (mirrors rcce2's own
"Restart-on-Delete" iterator doctrine). Only what evidence supports.
5. **Pin**: BlitzForge `tests/` regression (long-chain create/Delete Each/
exit, both GC and no-GC shapes); rerun the statistical loop (>=200 clean).
6. **Land**: BlitzForge PR → merge → rcce2 submodule-bump PR → update rcce2
CLAUDE.md flake note.

## Fallback floor (if root cause resists one iteration)

The instrumentation, measured baseline, probe results, and this note merge
anyway — converting an opaque flake into an addressable bug with evidence.

## Files touched (planned)

- `src/blitzrc/bbruntime_dll/bbruntime_dll.cpp` — seTranslator diagnostics.
- `src/blitzrc/bbruntime/basic.cpp` — the fix (site TBD by evidence).
- `tests/` — regression test(s).
- this note.

## Acceptance criteria

- Panic messages include exception code + module-relative fault address.
- Root cause identified from evidence (address/probe), not the symptom label.
- Baseline failure rate measured pre-fix; >=200 consecutive clean runs of both
flaky rcce2 tests post-fix.
- BlitzForge suite green; full rcce2 compile + suite green on the bump.
- rcce2 CLAUDE.md "Known intermittent flake" updated.
28 changes: 25 additions & 3 deletions src/blitzrc/bbruntime/basic.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -308,17 +308,39 @@ void _bbObjDelete( BBObj *obj ){
handle_map.erase( it->second );
object_map.erase( it );
}
// GC desync guard: Delete / Delete Each reach here without going through
// the GC layer, leaving the pointer counted in reference_map. A later
// _bbRelease on that stale entry would re-run _bbObjDelete against a
// freed -- possibly recycled -- slot and release garbage "fields"
// (heap corruption surfacing as the intermittent exit-time crash).
// Erasing here makes the stale release a no-op; the GC's own delete
// path erases before calling us, so this is a no-op double-erase there.
reference_map.erase( reinterpret_cast<int>(obj) );
obj->fields=0;
_bbObjRelease( obj );
--objCnt;
}

void _bbObjDeleteEach( BBObjType *type ){
// Restart-on-delete walk. The field-release cascade inside _bbObjDelete
// can drop another node's ref count to zero and move it from `used` to
// `free` -- including the would-be captured `next` (chain shape: a
// zombie kept alive only by its predecessor's field). A captured-next
// walk that steps onto a node relocated to the free list terminates at
// the free sentinel, silently leaving the rest of the list undeleted
// (deterministic repro: tests/DeleteEachChainTest.bb), or wanders
// invalid linkage. Re-reading used.next after every delete only ever
// observes live list state; zombies (fields==0) are stepped over
// in-place, and each delete removes at least one fielded node from
// `used`, so the walk terminates.
BBObj *obj=type->used.next;
while( obj->type ){
BBObj *next=obj->next;
if( obj->fields ) _bbObjDelete( obj );
obj=next;
if( obj->fields ){
_bbObjDelete( obj );
obj=type->used.next;
}else{
obj=obj->next;
}
}
}

Expand Down
42 changes: 42 additions & 0 deletions src/blitzrc/bbruntime_dll/bbruntime_dll.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,27 @@ static void killer(){
}
#endif

// Describe a code/data address as "module+0xOFFSET" when it falls inside a
// loaded module, else as a raw pointer. Module-relative offsets stay stable
// across ASLR runs, so an intermittent fault becomes attributable to a
// specific function via the linker map even when it only fires 1-in-20 runs.
static string describeAddress( void *addr ){
char buf[160];
MEMORY_BASIC_INFORMATION mbi;
if( addr && VirtualQuery( addr,&mbi,sizeof(mbi) ) && mbi.AllocationBase ){
char path[MAX_PATH];
DWORD n=GetModuleFileName( (HMODULE)mbi.AllocationBase,path,MAX_PATH );
if( n ){
const char *base=path;
for( const char *p=path;*p;++p ) if( *p=='\\'||*p=='/' ) base=p+1;
sprintf( buf,"%s+0x%X",base,(unsigned int)((char*)addr-(char*)mbi.AllocationBase) );
return buf;
}
}
sprintf( buf,"0x%p",addr );
return buf;
}

static void _cdecl seTranslator( unsigned int u,EXCEPTION_POINTERS* pExp ){

string panicStr = "Unknown runtime exception";
Expand All @@ -70,6 +91,27 @@ static void _cdecl seTranslator( unsigned int u,EXCEPTION_POINTERS* pExp ){
break;
}

// Append fault diagnostics: exception code, faulting instruction
// (module-relative), and for access violations the access kind and
// target address. Without this every intermittent native crash is an
// unattributable one-liner; with it the offset maps to a function via
// the linker map. Stack-overflow handling stays minimal on purpose --
// describeAddress allocates stack, and the guard page is already blown.
if( pExp && pExp->ExceptionRecord && u!=EXCEPTION_STACK_OVERFLOW ){
EXCEPTION_RECORD *er=pExp->ExceptionRecord;
char info[64];
sprintf( info," [code 0x%08X at ",(unsigned int)er->ExceptionCode );
panicStr+=info;
panicStr+=describeAddress( er->ExceptionAddress );
if( u==EXCEPTION_ACCESS_VIOLATION && er->NumberParameters>=2 ){
const char *kind=er->ExceptionInformation[0]==0 ? "reading" :
(er->ExceptionInformation[0]==1 ? "writing" : "executing");
sprintf( info,", %s 0x%08X",kind,(unsigned int)er->ExceptionInformation[1] );
panicStr+=info;
}
panicStr+="]";
}

bbruntime_panic( panicStr.c_str() );
}

Expand Down
78 changes: 78 additions & 0 deletions tests/DeleteEachChainTest.bb
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Strict

// Regression: _bbObjDeleteEach captured-next invalidation.
//
// `Delete Each T` used to capture `next = obj->next` before deleting obj.
// The field-release cascade inside _bbObjDelete can drop another node's
// ref count to zero and move it from the type's `used` list to its `free`
// list -- including that captured `next`. The classic trigger is a chain
// where a zombie (explicitly Deleted, fields gone) is kept alive only by
// its predecessor's field: deleting the predecessor releases the zombie
// to the free list, the walk steps onto it, follows free-list linkage to
// the free sentinel (type == 0), and terminates -- silently leaving the
// rest of the list undeleted. Before the fix the zombie-chain test below
// left 499 of 1000 nodes alive.
//
// This is the runtime-side root of the long-standing rcce2 CI flake
// (ItemsTest / chain-test exit crashes): early-terminated or free-list-
// wandering sweeps. Fixed by a restart-on-delete walk in _bbObjDeleteEach.

Type ChainNode
Field nxt.ChainNode
End Type

Function liveChainNodes%()
Local n% = 0
Local s.ChainNode = Null
For s = Each ChainNode
n = n + 1
Next
Return n
End Function

Function buildChain.ChainNode(count%)
Local head.ChainNode = New ChainNode()
Local cur.ChainNode = head
Local i%
For i = 2 To count
cur\nxt = New ChainNode()
cur = cur\nxt
Next
Return head
End Function

// Plain forward chain: every node must be swept.
Test deleteEachFullChain()
Local head.ChainNode = buildChain(1000)
head = Null
Delete Each ChainNode
Assert(liveChainNodes() = 0)
End Test

// Zombie-interleaved chain: Delete every other node first (each zombie's
// memory survives only through its predecessor's field), then sweep. The
// pre-fix walk terminated at the first zombie released mid-sweep.
Test deleteEachZombieChain()
Local head.ChainNode = buildChain(1000)

Local z.ChainNode = head\nxt
While z <> Null
Local znext.ChainNode = Null
If z\nxt <> Null Then znext = z\nxt\nxt Else znext = Null
Delete z
z = znext
Wend

head = Null
Delete Each ChainNode
Assert(liveChainNodes() = 0)
End Test

// Sweep twice: the second sweep must see an empty list and no stale state.
Test deleteEachIdempotent()
Local head.ChainNode = buildChain(50)
head = Null
Delete Each ChainNode
Delete Each ChainNode
Assert(liveChainNodes() = 0)
End Test
65 changes: 65 additions & 0 deletions tests/DeleteEachGCRefMapTest.bb
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
Strict
EnableGC

// Regression: dual-refcount desync between the classic object pool and the
// GC reference_map.
//
// Delete / Delete Each free pool-side objects directly (_bbObjDelete) and
// used to leave the pointer still counted in the GC reference_map. A later
// _bbRelease on that stale entry hit count==0 and re-ran _bbObjDelete
// against a freed -- possibly recycled -- slot, releasing garbage "fields"
// (heap corruption surfacing as the intermittent exit-time access
// violation in downstream rcce2 CI: ItemsTest / OnlinePlayerChainTest).
// _bbObjDelete now erases the pointer from reference_map, so RefCount on a
// pool-deleted object must read 0 and the eventual scope-end release must
// be a no-op.

Type RefMapNode
Field nxt.RefMapNode
Field tag$
End Type

Test gcRefMapClearedByDeleteEach()
Local n.RefMapNode = New RefMapNode()
n\tag = "tracked"
Assert(RefCount(n) >= 1)
Delete Each RefMapNode
// Pre-fix this read the stale reference_map count (>= 1).
Assert(RefCount(n) = 0)
End Test

Test gcRefMapClearedByExplicitDelete()
Local n.RefMapNode = New RefMapNode()
Assert(RefCount(n) >= 1)
Delete n
Assert(RefCount(n) = 0)
End Test

// Sweep + recycle + release: the historical crash shape. With the erase in
// place the stale releases are no-ops and the recycled occupants survive.
Test gcStaleReleaseAfterRecycleIsBenign()
Local head.RefMapNode = New RefMapNode()
Local cur.RefMapNode = head
Local i%
For i = 1 To 200
cur\nxt = New RefMapNode()
cur\nxt\tag = "n" + i
cur = cur\nxt
Next
cur = Null
Delete Each RefMapNode

// Recycle the freed slots with new, differently-tagged occupants.
Local keep.RefMapNode = New RefMapNode()
keep\tag = "occupant"
Local j%
For j = 1 To 200
Local filler.RefMapNode = New RefMapNode()
filler\tag = "f" + j
filler = Null
Next

// Stale release of the swept chain head: must not touch the occupants.
head = Null
Assert(keep\tag = "occupant")
End Test
Loading