Skip to content

index: WordIndex never reuses doc_id slots — id_to_path grows unbounded in long-lived daemons #606

@justrach

Description

@justrach

Problem

WordIndex.removeFile tombstones the removed file's slot (id_to_path.items[doc_id] = "", src/index.zig:138), but getOrCreateDocId (src/index.zig:42-48) always appends a fresh slot:

fn getOrCreateDocId(self: *WordIndex, path: []const u8) !u32 {
    if (self.path_to_id.get(path)) |id| return id;
    const id: u32 = @intCast(self.id_to_path.items.len);   // never reuses tombstones
    try self.id_to_path.append(self.allocator, path);
    ...
}

TrigramIndex solves the same problem with a free_ids reuse pool; WordIndex has no equivalent. In a long-lived daemon (the post-#592 world) every deleted file permanently leaks one id_to_path slot, so branch switches and large refactors ratchet the table up forever. Small per-slot cost, unbounded in principle, and inconsistent with its sibling index.

Failing Test

src/test_index.zig (fails on current release tip: expected 1, found 2):

test "issue-XX: word index reuses doc_id slots freed by removeFile" {
    var wi = WordIndex.init(testing.allocator);
    defer wi.deinit();

    try wi.indexFile("a.zig", "const alpha = 1;\n");
    wi.removeFile("a.zig");
    try wi.indexFile("b.zig", "const beta = 2;\n");

    try testing.expectEqual(@as(usize, 1), wi.id_to_path.items.len);
}

Expected

A doc_id freed by removeFile is available for the next getOrCreateDocId, keeping id_to_path bounded by the live file count — mirroring TrigramIndex.free_ids.

Fix

Add a free_ids: std.ArrayList(u32) pool: removeFile pushes the tombstoned id; getOrCreateDocId pops before appending. Caution — ABA with persistence: postings written by persistWordIndexToDisk reference doc_ids, so reuse is only safe because a full persist rewrites postings and the id table together from live in-memory state; verify the bulk-load path (#583) drops removed files' postings before their ids are recycled, and that doc_lengths for a reused id is always re-set (it is fetchRemoved in removeFile, src/index.zig:140). A stale on-disk posting resolving to a recycled id would silently attribute hits to the wrong file — the failing test should grow a persist/reload round-trip assertion when this is picked up.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions