Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Full release notes with details on each version: [GitHub Releases](https://githu

## Unreleased

- Fix: an edited Office source now re-enters `--update`, and unchanged PDFs/Office files are no longer re-parsed on every run (#1649 / #1656, thanks @Ns2384-star). A modified `.docx`/`.xlsx` reused a byte-identical, deterministically-named sidecar whose content was never refreshed, so `graphify update` never re-extracted it (#1649); meanwhile every incremental run re-unzipped each Office file and re-parsed each PDF just to recount words (#1656). The `.docx`/`.xlsx` sidecar header now records a raw-byte `source-md5` fingerprint of the source — reading the bytes doesn't unzip the OOXML container, so it's cheap. A matching fingerprint returns the existing sidecar without re-parsing or rewriting it (preserving the #1226 unchanged-mtime no-churn guarantee), while a differing or legacy-missing fingerprint re-parses and rewrites so the new sidecar mtime/hash flows the edit through `detect_incremental`. Per-file word counts are also cached in the manifest and reused for any file whose mtime is unchanged, so an unchanged PDF/Office file is parsed once rather than on every run.
- Fix: a malformed semantic chunk no longer crashes `extract` and discards every successful chunk (#1631, thanks @ssazy). When an LLM returned a well-formed object whose `edges` (or `nodes`/`hyperedges`) array carried a stray non-dict entry — a nested list where an edge object belongs — the AST+semantic merge and the semantic-cache write both called `.get()` per entry and raised `AttributeError: 'list' object has no attribute 'get'`. On a 34-chunk run where 33 succeeded, that meant no `graph.json` was written and the cache write failed too, so a re-run re-extracted everything. `_parse_llm_json` now sanitizes each fragment at the single parse chokepoint (keeping only dict entries and coercing a non-list value to `[]`), so the cache writer, the adaptive-retry merge, and the CLI merge are all protected in one place.
- Fix: an unresolved bare npm import no longer aliases onto an unrelated same-named local file (#1638, thanks @EveX1). `import colors from "tailwindcss/colors"` in a `.tsx` file emitted an `imports_from` edge to the bare id `colors`, and build.py's pre-migration alias index (which registers every local file's bare stem) then remapped it onto an unrelated `backend/utils/colors.py` — a confident (`EXTRACTED`) cross-language phantom edge, and one per `.tsx` file sharing the import. In a real monorepo eight unrelated `.tsx` files all landed on a single Python module. Common package subpaths (`colors`, `utils`, `types`, `config`, `client`) collide this way constantly. The external-import fallback now namespaces its target with the `ref` prefix (the same J-4 convention used for tsconfig `extends`/`$ref` externals), so it can never collapse to a local file/symbol id; the ref-namespaced target has no node, so build drops it as an external reference — the correct outcome for a third-party import.
- Fix: `graph.json` node/edge ordering is now stable run-to-run for document/semantic corpora (#1632, thanks @umeshpsatwe). With a parallel LLM backend, `extract_corpus_parallel` merged chunk results in completion order, so which network call happened to return first reordered the nodes and edges even when the model returned identical content — churning `graph.json` between otherwise-identical runs. Chunks are now merged in deterministic submission order after the pool drains (matching the serial path); the progress callback still fires in completion order so long local runs aren't silent. Note: the semantic content the LLM extracts is itself nondeterministic run-to-run — this fix removes the pipeline's own ordering churn, not the model's variance.
Expand Down
134 changes: 114 additions & 20 deletions graphify/detect.py
Original file line number Diff line number Diff line change
Expand Up @@ -601,21 +601,38 @@ def _edge(src: str, tgt: str, relation: str) -> None:
return {"nodes": nodes, "edges": edges}


# Sidecar header records the source's raw-byte fingerprint so a later run can
# tell whether the Office file changed without re-parsing it (#1649, #1656).
# Anchor to the ` | source-md5: <fp> -->` delimiter/terminator at the END of the
# header line so a source *filename* that happens to contain a "source-md5: ..."
# substring can't be captured as the fingerprint — otherwise the real fingerprint
# would never match and the file would re-parse+rewrite (and re-queue) every run.
_SIDECAR_SOURCE_FP_RE = re.compile(r"\| source-md5:\s*([0-9a-f]+)\s*-->\s*$")


def _read_sidecar_source_fingerprint(out_path: Path) -> str | None:
"""Read the source fingerprint stored in an existing sidecar's header.

Returns ``None`` for legacy sidecars written before the fingerprint was
added, so they are treated as needing a refresh on the next run.
"""
try:
with out_path.open("r", encoding="utf-8") as fh:
first_line = fh.readline()
except OSError:
return None
m = _SIDECAR_SOURCE_FP_RE.search(first_line)
return m.group(1) if m else None


def convert_office_file(path: Path, out_dir: Path) -> Path | None:
"""Convert a .docx or .xlsx to a markdown sidecar in out_dir.

Returns the path of the converted .md file, or None if conversion failed
or the required library is not installed.
"""
ext = path.suffix.lower()
if ext == ".docx":
text = docx_to_markdown(path)
elif ext == ".xlsx":
text = xlsx_to_markdown(path)
else:
return None

if not text.strip():
if ext not in (".docx", ".xlsx"):
return None

out_dir.mkdir(parents=True, exist_ok=True)
Expand All @@ -630,13 +647,31 @@ def convert_office_file(path: Path, out_dir: Path) -> Path | None:
normalized_path = unicodedata.normalize("NFC", str(path.resolve()))
name_hash = hashlib.sha256(normalized_path.encode()).hexdigest()[:8]
out_path = out_dir / f"{path.stem}_{name_hash}.md"
# Once the hash is stable the sidecar name is deterministic; skip re-writing
# an existing sidecar so an unchanged source never churns its mtime (which
# would still flag it as changed in detect_incremental).

# Fingerprint the SOURCE by its raw bytes (md5). Reading the bytes does NOT
# unzip/parse the OOXML container, so this is cheap — that is the whole
# point. If a sidecar already exists and records the same fingerprint the
# source is unchanged: return it WITHOUT parsing (avoids the per-run
# re-parse of #1656) and WITHOUT rewriting (so an unchanged source never
# churns its mtime, which detect_incremental would otherwise flag as
# changed — #1226). If the fingerprint differs (or is missing, e.g. a legacy
# sidecar) the source was edited: re-parse and rewrite so the change flows
# through detect_incremental via the sidecar's new mtime/md5 (#1649).
source_fp = _md5_file(path)
if out_path.exists():
return out_path
if not source_fp or _read_sidecar_source_fingerprint(out_path) == source_fp:
return out_path

if ext == ".docx":
text = docx_to_markdown(path)
else:
text = xlsx_to_markdown(path)

if not text.strip():
return None

out_path.write_text(
f"<!-- converted from {path.name} -->\n\n{text}",
f"<!-- converted from {path.name} | source-md5: {source_fp} -->\n\n{text}",
encoding="utf-8",
)
return out_path
Expand Down Expand Up @@ -1015,7 +1050,14 @@ def _resolves_under_root(path: Path, root: Path) -> bool:
return True


def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace: bool | None = None, extra_excludes: list[str] | None = None) -> dict:
def detect(
root: Path,
*,
follow_symlinks: bool | None = None,
google_workspace: bool | None = None,
extra_excludes: list[str] | None = None,
prior_word_counts: dict[str, tuple[float, int]] | None = None,
) -> dict:
root = root.resolve()
if follow_symlinks is None:
follow_symlinks = False
Expand All @@ -1029,6 +1071,21 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
}
total_words = 0

# Word counting parses PDFs (extract_pdf_text) and reads sidecars on every
# file; re-doing it for unchanged files on each incremental run is pure waste
# (#1656). When the caller supplies the previous run's per-file counts, reuse
# the cached count for any file whose mtime is unchanged instead of parsing.
def _count_words(fp: Path) -> int:
if prior_word_counts is not None:
cached = prior_word_counts.get(str(fp))
if cached is not None:
try:
if fp.stat().st_mtime == cached[0]:
return cached[1]
except OSError:
pass
return count_words(fp)

skipped_sensitive: list[str] = []
ignore_patterns = _load_graphifyignore(root)
ignore_cache: dict[Path, bool] = {} # shared across all _is_ignored calls in this scan
Expand Down Expand Up @@ -1133,7 +1190,7 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
if _is_ignored(md_path, root, ignore_patterns, _cache=ignore_cache):
continue
files[ftype].append(str(md_path))
total_words += count_words(md_path)
total_words += _count_words(md_path)
else:
skipped_sensitive.append(str(p) + " [Google Workspace export produced no readable text]")
continue
Expand All @@ -1144,14 +1201,14 @@ def detect(root: Path, *, follow_symlinks: bool | None = None, google_workspace:
if _is_ignored(md_path, root, ignore_patterns, _cache=ignore_cache):
continue
files[ftype].append(str(md_path))
total_words += count_words(md_path)
total_words += _count_words(md_path)
else:
# Conversion failed (library not installed) - skip with note
skipped_sensitive.append(str(p) + " [office conversion failed - pip install graphifyy[office]]")
continue
files[ftype].append(str(p))
if ftype != FileType.VIDEO:
total_words += count_words(p)
total_words += _count_words(p)

for ftype in files:
files[ftype].sort()
Expand Down Expand Up @@ -1323,6 +1380,9 @@ def _normalise_entry(entry):
continue

all_files = [f for file_list in files.values() for f in file_list]
# Video files are never word-counted (mirrors detect(), and avoids reading
# large media in as text).
video_set = set(files.get("video", []))
with ThreadPoolExecutor() as pool:
raw = pool.map(_stat_and_hash, all_files)
hashed: dict[str, tuple[float, str]] = {
Expand All @@ -1334,6 +1394,14 @@ def _normalise_entry(entry):
continue # file deleted between detect() and manifest write
mtime, h = hashed[f]
prev = _normalise_entry(existing.get(f, {})) or {}
# Compare the file's content hash (h) against the prior hash of the SAME
# kind so an unchanged file is recognised regardless of which pipeline
# wrote the manifest. ast/both key off the prior ast_hash (unchanged
# behaviour); a semantic-only manifest never populates ast_hash, so key
# off the prior semantic_hash there — otherwise content_unchanged would
# always be False and the word_count cache below could never be reused.
prior_hash = prev.get("semantic_hash", "") if kind == "semantic" else prev.get("ast_hash", "")
content_unchanged = h == prior_hash
entry: dict = {"mtime": mtime}
if kind in ("ast", "both"):
entry["ast_hash"] = h
Expand All @@ -1343,7 +1411,17 @@ def _normalise_entry(entry):
entry["semantic_hash"] = h
else:
# Preserve semantic_hash only when content is unchanged
entry["semantic_hash"] = prev.get("semantic_hash", "") if h == prev.get("ast_hash", "") else ""
entry["semantic_hash"] = prev.get("semantic_hash", "") if content_unchanged else ""
# Cache the word count so detect() can reuse it for unchanged files
# rather than re-parsing (esp. PDFs) on every incremental run (#1656).
# Reuse the previous count whenever the content hash is unchanged; only
# (re)parse a genuinely new or changed file.
if f in video_set:
entry["word_count"] = 0
elif content_unchanged and isinstance(prev.get("word_count"), int):
entry["word_count"] = prev["word_count"]
else:
entry["word_count"] = count_words(Path(f))
manifest[f] = entry
if root is not None:
# Persist in portable form: forward-slash relative paths. Keys outside
Expand Down Expand Up @@ -1386,11 +1464,27 @@ def detect_incremental(
runs. ``None`` (default) does not follow symlinked directories; callers must
opt in explicitly, and resolved targets outside the scan root are skipped.
"""
full = detect(root, follow_symlinks=follow_symlinks, google_workspace=google_workspace, extra_excludes=extra_excludes)
# Pass ``root`` so a manifest written with relative keys (post-#777) is
# re-anchored to the absolute form the rest of this function compares
# against. Legacy absolute-keyed manifests pass through unchanged.
# against. Legacy absolute-keyed manifests pass through unchanged. Loaded
# before detect() so the cached per-file word counts can be handed down and
# reused for unchanged files (avoids re-parsing every PDF each run — #1656).
manifest = load_manifest(manifest_path, root=root)
prior_word_counts: dict[str, tuple[float, int]] = {}
for key, entry in manifest.items():
if isinstance(entry, dict):
wc = entry.get("word_count")
mt = entry.get("mtime")
if isinstance(wc, int) and isinstance(mt, (int, float)):
prior_word_counts[key] = (mt, wc)

full = detect(
root,
follow_symlinks=follow_symlinks,
google_workspace=google_workspace,
extra_excludes=extra_excludes,
prior_word_counts=prior_word_counts,
)

if not manifest:
# No previous run - treat everything as new
Expand Down
Loading