Graphify-Labs · TPAteeq · Jul 4, 2026 · Jul 4, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,7 @@ Full release notes with details on each version: [GitHub Releases](https://githu
 
 ## Unreleased
 
+- Fix: an edited Office source now re-enters `--update`, and unchanged PDFs/Office files are no longer re-parsed on every run (#1649 / #1656, thanks @Ns2384-star). A modified `.docx`/`.xlsx` reused a byte-identical, deterministically-named sidecar whose content was never refreshed, so `graphify update` never re-extracted it (#1649); meanwhile every incremental run re-unzipped each Office file and re-parsed each PDF just to recount words (#1656). The `.docx`/`.xlsx` sidecar header now records a raw-byte `source-md5` fingerprint of the source — reading the bytes doesn't unzip the OOXML container, so it's cheap. A matching fingerprint returns the existing sidecar without re-parsing or rewriting it (preserving the #1226 unchanged-mtime no-churn guarantee), while a differing or legacy-missing fingerprint re-parses and rewrites so the new sidecar mtime/hash flows the edit through `detect_incremental`. Per-file word counts are also cached in the manifest and reused for any file whose mtime is unchanged, so an unchanged PDF/Office file is parsed once rather than on every run.
 - Fix: a malformed semantic chunk no longer crashes `extract` and discards every successful chunk (#1631, thanks @ssazy). When an LLM returned a well-formed object whose `edges` (or `nodes`/`hyperedges`) array carried a stray non-dict entry — a nested list where an edge object belongs — the AST+semantic merge and the semantic-cache write both called `.get()` per entry and raised `AttributeError: 'list' object has no attribute 'get'`. On a 34-chunk run where 33 succeeded, that meant no `graph.json` was written and the cache write failed too, so a re-run re-extracted everything. `_parse_llm_json` now sanitizes each fragment at the single parse chokepoint (keeping only dict entries and coercing a non-list value to `[]`), so the cache writer, the adaptive-retry merge, and the CLI merge are all protected in one place.
 - Fix: an unresolved bare npm import no longer aliases onto an unrelated same-named local file (#1638, thanks @EveX1). `import colors from "tailwindcss/colors"` in a `.tsx` file emitted an `imports_from` edge to the bare id `colors`, and build.py's pre-migration alias index (which registers every local file's bare stem) then remapped it onto an unrelated `backend/utils/colors.py` — a confident (`EXTRACTED`) cross-language phantom edge, and one per `.tsx` file sharing the import. In a real monorepo eight unrelated `.tsx` files all landed on a single Python module. Common package subpaths (`colors`, `utils`, `types`, `config`, `client`) collide this way constantly. The external-import fallback now namespaces its target with the `ref` prefix (the same J-4 convention used for tsconfig `extends`/`$ref` externals), so it can never collapse to a local file/symbol id; the ref-namespaced target has no node, so build drops it as an external reference — the correct outcome for a third-party import.
 - Fix: `graph.json` node/edge ordering is now stable run-to-run for document/semantic corpora (#1632, thanks @umeshpsatwe). With a parallel LLM backend, `extract_corpus_parallel` merged chunk results in completion order, so which network call happened to return first reordered the nodes and edges even when the model returned identical content — churning `graph.json` between otherwise-identical runs. Chunks are now merged in deterministic submission order after the pool drains (matching the serial path); the progress callback still fires in completion order so long local runs aren't silent. Note: the semantic content the LLM extracts is itself nondeterministic run-to-run — this fix removes the pipeline's own ordering churn, not the model's variance.

diff --git a/graphify/detect.py b/graphify/detect.py
@@ -601,21 +601,38 @@ def _edge(src: str, tgt: str, relation: str) -> None:
     return {"nodes": nodes, "edges": edges}
 
 
+# Sidecar header records the source's raw-byte fingerprint so a later run can
+# tell whether the Office file changed without re-parsing it (#1649, #1656).
+# Anchor to the ` | source-md5: <fp> -->` delimiter/terminator at the END of the
+# header line so a source *filename* that happens to contain a "source-md5: ..."
+# substring can't be captured as the fingerprint — otherwise the real fingerprint
+# would never match and the file would re-parse+rewrite (and re-queue) every run.
+_SIDECAR_SOURCE_FP_RE = re.compile(r"\| source-md5:\s*([0-9a-f]+)\s*-->\s*$")
+
+
+def _read_sidecar_source_fingerprint(out_path: Path) -> str | None:
+    """Read the source fingerprint stored in an existing sidecar's header.
+
+    Returns ``None`` for legacy sidecars written before the fingerprint was
+    added, so they are treated as needing a refresh on the next run.
+    """
+    try:
+        with out_path.open("r", encoding="utf-8") as fh:
+            first_line = fh.readline()
+    except OSError:
+        return None
+    m = _SIDECAR_SOURCE_FP_RE.search(first_line)
+    return m.group(1) if m else None
+
+
 def convert_office_file(path: Path, out_dir: Path) -> Path | None:
     """Convert a .docx or .xlsx to a markdown sidecar in out_dir.
 
     Returns the path of the converted .md file, or None if conversion failed
     or the required library is not installed.
     """
     ext = path.suffix.lower()
-    if ext == ".docx":
-        text = docx_to_markdown(path)
-    elif ext == ".xlsx":
-        text = xlsx_to_markdown(path)
-    else:
-        return None
-
-    if not text.strip():
+    if ext not in (".docx", ".xlsx"):
         return None
 
     out_dir.mkdir(parents=True, exist_ok=True)
@@ -630,13 +647,31 @@ def convert_office_file(path: Path, out_dir: Path) -> Path | None:
     normalized_path = unicodedata.normalize("NFC", str(path.resolve()))
     name_hash = hashlib.sha256(normalized_path.encode()).hexdigest()[:8]
     out_path = out_dir / f"{path.stem}_{name_hash}.md"
-    # Once the hash is stable the sidecar name is deterministic; skip re-writing
-    # an existing sidecar so an unchanged source never churns its mtime (which
-    # would still flag it as changed in detect_incremental).
+
+    # Fingerprint the SOURCE by its raw bytes (md5). Reading the bytes does NOT
+    # unzip/parse the OOXML container, so this is cheap — that is the whole
+    # point. If a sidecar already exists and records the same fingerprint the
+    # source is unchanged: return it WITHOUT parsing (avoids the per-run
+    # re-parse of #1656) and WITHOUT rewriting (so an unchanged source never
+    # churns its mtime, which detect_incremental would otherwise flag as
+    # changed — #1226). If the fingerprint differs (or is missing, e.g. a legacy
+    # sidecar) the source was edited: re-parse and rewrite so the change flows
+    # through detect_incremental via the sidecar's new mtime/md5 (#1649).
+    source_fp = _md5_file(path)
     if out_path.exists():
-        return out_path
+        if not source_fp or _read_sidecar_source_fingerprint(out_path) == source_fp:
+            return out_path
+
+    if ext == ".docx":
+        text = docx_to_markdown(path)
+    else:
+        text = xlsx_to_markdown(path)
+
+    if not text.strip():
+        return None
+
     out_path.write_text(
-        f"<!-- converted from {path.name} -->\n\n{text}",
+        f"<!-- converted from {path.name} | source-md5: {source_fp} -->\n\n{text}",
         encoding="utf-8",
     )
     return out_path
@@ -1015,7 +1050,14 @@ def _resolves_under_root(path: Path, root: Path) -> bool:
     return True
 
 
-def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace: bool | None = None, extra_excludes: list[str] | None = None) -> dict:
+def detect(
+    root: Path,
+    *,
+    follow_symlinks: bool | None = None,
+    google_workspace: bool | None = None,
+    extra_excludes: list[str] | None = None,
+    prior_word_counts: dict[str, tuple[float, int]] | None = None,
+) -> dict:
     root = root.resolve()
     if follow_symlinks is None:
         follow_symlinks = False
@@ -1029,6 +1071,21 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
     }
     total_words = 0
 
+    # Word counting parses PDFs (extract_pdf_text) and reads sidecars on every
+    # file; re-doing it for unchanged files on each incremental run is pure waste
+    # (#1656). When the caller supplies the previous run's per-file counts, reuse
+    # the cached count for any file whose mtime is unchanged instead of parsing.
+    def _count_words(fp: Path) -> int:
+        if prior_word_counts is not None:
+            cached = prior_word_counts.get(str(fp))
+            if cached is not None:
+                try:
+                    if fp.stat().st_mtime == cached[0]:
+                        return cached[1]
+                except OSError:
+                    pass
+        return count_words(fp)
+
     skipped_sensitive: list[str] = []
     ignore_patterns = _load_graphifyignore(root)
     ignore_cache: dict[Path, bool] = {}  # shared across all _is_ignored calls in this scan
@@ -1133,7 +1190,7 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
                     if _is_ignored(md_path, root, ignore_patterns, _cache=ignore_cache):
                         continue
                     files[ftype].append(str(md_path))
-                    total_words += count_words(md_path)
+                    total_words += _count_words(md_path)
                 else:
                     skipped_sensitive.append(str(p) + " [Google Workspace export produced no readable text]")
                 continue
@@ -1144,14 +1201,14 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
                     if _is_ignored(md_path, root, ignore_patterns, _cache=ignore_cache):
                         continue
                     files[ftype].append(str(md_path))
-                    total_words += count_words(md_path)
+                    total_words += _count_words(md_path)
                 else:
                     # Conversion failed (library not installed) - skip with note
                     skipped_sensitive.append(str(p) + " [office conversion failed - pip install graphifyy[office]]")
                 continue
             files[ftype].append(str(p))
             if ftype != FileType.VIDEO:
-                total_words += count_words(p)
+                total_words += _count_words(p)
 
     for ftype in files:
         files[ftype].sort()
@@ -1323,6 +1380,9 @@ def _normalise_entry(entry):
             continue
 
     all_files = [f for file_list in files.values() for f in file_list]
+    # Video files are never word-counted (mirrors detect(), and avoids reading
+    # large media in as text).
+    video_set = set(files.get("video", []))
     with ThreadPoolExecutor() as pool:
         raw = pool.map(_stat_and_hash, all_files)
     hashed: dict[str, tuple[float, str]] = {
@@ -1334,6 +1394,14 @@ def _normalise_entry(entry):
             continue  # file deleted between detect() and manifest write
         mtime, h = hashed[f]
         prev = _normalise_entry(existing.get(f, {})) or {}
+        # Compare the file's content hash (h) against the prior hash of the SAME
+        # kind so an unchanged file is recognised regardless of which pipeline
+        # wrote the manifest. ast/both key off the prior ast_hash (unchanged
+        # behaviour); a semantic-only manifest never populates ast_hash, so key
+        # off the prior semantic_hash there — otherwise content_unchanged would
+        # always be False and the word_count cache below could never be reused.
+        prior_hash = prev.get("semantic_hash", "") if kind == "semantic" else prev.get("ast_hash", "")
+        content_unchanged = h == prior_hash
         entry: dict = {"mtime": mtime}
         if kind in ("ast", "both"):
             entry["ast_hash"] = h
@@ -1343,7 +1411,17 @@ def _normalise_entry(entry):
             entry["semantic_hash"] = h
         else:
             # Preserve semantic_hash only when content is unchanged
-            entry["semantic_hash"] = prev.get("semantic_hash", "") if h == prev.get("ast_hash", "") else ""
+            entry["semantic_hash"] = prev.get("semantic_hash", "") if content_unchanged else ""
+        # Cache the word count so detect() can reuse it for unchanged files
+        # rather than re-parsing (esp. PDFs) on every incremental run (#1656).
+        # Reuse the previous count whenever the content hash is unchanged; only
+        # (re)parse a genuinely new or changed file.
+        if f in video_set:
+            entry["word_count"] = 0
+        elif content_unchanged and isinstance(prev.get("word_count"), int):
+            entry["word_count"] = prev["word_count"]
+        else:
+            entry["word_count"] = count_words(Path(f))
         manifest[f] = entry
     if root is not None:
         # Persist in portable form: forward-slash relative paths. Keys outside
@@ -1386,11 +1464,27 @@ def detect_incremental(
     runs. ``None`` (default) does not follow symlinked directories; callers must
     opt in explicitly, and resolved targets outside the scan root are skipped.
     """
-    full = detect(root, follow_symlinks=follow_symlinks, google_workspace=google_workspace, extra_excludes=extra_excludes)
     # Pass ``root`` so a manifest written with relative keys (post-#777) is
     # re-anchored to the absolute form the rest of this function compares
-    # against. Legacy absolute-keyed manifests pass through unchanged.
+    # against. Legacy absolute-keyed manifests pass through unchanged. Loaded
+    # before detect() so the cached per-file word counts can be handed down and
+    # reused for unchanged files (avoids re-parsing every PDF each run — #1656).
     manifest = load_manifest(manifest_path, root=root)
+    prior_word_counts: dict[str, tuple[float, int]] = {}
+    for key, entry in manifest.items():
+        if isinstance(entry, dict):
+            wc = entry.get("word_count")
+            mt = entry.get("mtime")
+            if isinstance(wc, int) and isinstance(mt, (int, float)):
+                prior_word_counts[key] = (mt, wc)
+
+    full = detect(
+        root,
+        follow_symlinks=follow_symlinks,
+        google_workspace=google_workspace,
+        extra_excludes=extra_excludes,
+        prior_word_counts=prior_word_counts,
+    )
 
     if not manifest:
         # No previous run - treat everything as new