Skip to content

Non-deterministic community assignments across identical-corpus runs (0.9.6) #1667

Description

@krishnateja7

Second of the two 0.9.6 follow-ups (see the zero-nodes-in-batch issue filed alongside — they may share a parallel-processing root cause).

Symptom

Two consecutive graphify update . runs on an unchanged corpus produce graph.json files that differ only in community assignments:

  • node-id set: identical
  • link multiset (all fields): identical
  • community: differs on 196 of 11,819 nodes

So the semantic graph is deterministic, but the artifact is not byte-reproducible.

It appears intermittent: one other consecutive pair of runs produced byte-identical output. That pattern is consistent with an unseeded RNG or parallel-ordering dependence in the community-detection step (Louvain/Leiden-style methods are tie-break sensitive).

On 0.9.5 we never observed this — we actually used file-hash equality of double runs as an acceptance test for our own append pipeline, and it held.

Ask

A fixed seed and/or deterministic tie-breaking in community detection, so identical input produces byte-identical graph.json. Reproducible output is useful for caching, CI artifact diffing, and downstream tools that append to the graph (our case).

Environment

  • graphifyy 0.9.6 (uv tool install), macOS (Darwin 25.5)
  • Rails repo: ~11,800 nodes / ~19,800 links
  • Diff method: parse both files, compare node-id sets, per-node field diff, link multiset comparison

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions