Problem
WordIndex.removeFile tombstones the removed file's slot (id_to_path.items[doc_id] = "", src/index.zig:138), but getOrCreateDocId (src/index.zig:42-48) always appends a fresh slot:
fn getOrCreateDocId(self: *WordIndex, path: []const u8) !u32 {
if (self.path_to_id.get(path)) |id| return id;
const id: u32 = @intCast(self.id_to_path.items.len); // never reuses tombstones
try self.id_to_path.append(self.allocator, path);
...
}
TrigramIndex solves the same problem with a free_ids reuse pool; WordIndex has no equivalent. In a long-lived daemon (the post-#592 world) every deleted file permanently leaks one id_to_path slot, so branch switches and large refactors ratchet the table up forever. Small per-slot cost, unbounded in principle, and inconsistent with its sibling index.
Failing Test
src/test_index.zig (fails on current release tip: expected 1, found 2):
test "issue-XX: word index reuses doc_id slots freed by removeFile" {
var wi = WordIndex.init(testing.allocator);
defer wi.deinit();
try wi.indexFile("a.zig", "const alpha = 1;\n");
wi.removeFile("a.zig");
try wi.indexFile("b.zig", "const beta = 2;\n");
try testing.expectEqual(@as(usize, 1), wi.id_to_path.items.len);
}
Expected
A doc_id freed by removeFile is available for the next getOrCreateDocId, keeping id_to_path bounded by the live file count — mirroring TrigramIndex.free_ids.
Fix
Add a free_ids: std.ArrayList(u32) pool: removeFile pushes the tombstoned id; getOrCreateDocId pops before appending. Caution — ABA with persistence: postings written by persistWordIndexToDisk reference doc_ids, so reuse is only safe because a full persist rewrites postings and the id table together from live in-memory state; verify the bulk-load path (#583) drops removed files' postings before their ids are recycled, and that doc_lengths for a reused id is always re-set (it is fetchRemoved in removeFile, src/index.zig:140). A stale on-disk posting resolving to a recycled id would silently attribute hits to the wrong file — the failing test should grow a persist/reload round-trip assertion when this is picked up.
Problem
WordIndex.removeFiletombstones the removed file's slot (id_to_path.items[doc_id] = "", src/index.zig:138), butgetOrCreateDocId(src/index.zig:42-48) always appends a fresh slot:TrigramIndexsolves the same problem with afree_idsreuse pool;WordIndexhas no equivalent. In a long-lived daemon (the post-#592 world) every deleted file permanently leaks oneid_to_pathslot, so branch switches and large refactors ratchet the table up forever. Small per-slot cost, unbounded in principle, and inconsistent with its sibling index.Failing Test
src/test_index.zig(fails on current release tip:expected 1, found 2):Expected
A doc_id freed by
removeFileis available for the nextgetOrCreateDocId, keepingid_to_pathbounded by the live file count — mirroringTrigramIndex.free_ids.Fix
Add a
free_ids: std.ArrayList(u32)pool:removeFilepushes the tombstoned id;getOrCreateDocIdpops before appending. Caution — ABA with persistence: postings written bypersistWordIndexToDiskreference doc_ids, so reuse is only safe because a full persist rewrites postings and the id table together from live in-memory state; verify the bulk-load path (#583) drops removed files' postings before their ids are recycled, and thatdoc_lengthsfor a reused id is always re-set (it is fetchRemoved in removeFile, src/index.zig:140). A stale on-disk posting resolving to a recycled id would silently attribute hits to the wrong file — the failing test should grow a persist/reload round-trip assertion when this is picked up.