fix(parse-knowledge-base): extract CommonMark [](page.md) links in Karpathy code path (#361)#396
Open
tirth8205 wants to merge 1 commit into
Open
Conversation
…rpathy code path (Egonex-AI#361) The deterministic parser only extracted links via `[[wikilink]]` syntax. A Karpathy-pattern wiki (has index.md + multiple cross-linked .md files + schema) that uses CommonMark `[label](page.md)` links — common on GitHub/GitLab where `[[wikilinks]]` aren't rendered — was detected as karpathy but produced zero deterministic edges, leaving the graph to be inferred entirely from prose by the LLM phase. Inside the existing Karpathy code path, also extract `[label](page.md)` links and resolve them by normalised relative path. Both `parse_index` and the per-article extraction loop now scan both link styles, so category membership and inter-article edges are recovered for mixed and pure CommonMark Karpathy wikis. Pure-wikilink wikis remain byte-for-byte equivalent (no regression). Resolution handles `pages/x.md`, `./pages/x.md`, and `/pages/x.md` identically; query/fragment suffixes are stripped; image links, external URLs, and fenced code blocks are filtered. Distinct from Egonex-AI#342 (still wikilink-only) and Egonex-AI#312 (separate doctrine format gated on `index.md` being absent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #361.
/understand-knowledge's deterministic parser extracted links only via[[wikilink]]syntax. A Karpathy-pattern wiki (hasindex.md+ multiple cross-linked.mdfiles + schema) that uses CommonMark[label](page.md)links — common on GitHub/GitLab where[[wikilinks]]aren't rendered — was detected askarpathybut produced zero deterministic edges. The graph then had to be inferred entirely from prose by the LLM phase, producing a noisy, unreliable result instead of the real link structure.This PR extracts
[label](page.md)links inside the existing Karpathy code path, alongside the existing[[ ]]handling. Bothparse_index(for## Section→ article category membership) and the per-article extraction loop now scan both link styles, so category membership and inter-article edges are recovered for Karpathy wikis that use CommonMark links — pure or mixed.Gap vs existing issues
[[wikilink]]path resolution (title-caseIndex.md, repo-root prefixes) — still wikilink-only, doesn't extract[](page.md).[](path.md)parsing but as a separatedoctrineformat detected only whenindex.mdis absent, with explicit "Karpathy detection takes precedence." A wiki that hasindex.mdand uses markdown links is therefore still routed to the Karpathy path and still yields 0 edges.This PR covers the case neither addresses: Karpathy-detected + CommonMark links.
What changed
In
understand-anything-plugin/skills/understand-knowledge/parse-knowledge-base.py:MD_LINK_RE— CommonMark[label](target)regex with image-link (!) negative lookbehind.is_internal_md_target— filters external URLs (http://,mailto:, …), bare anchors (#section), non-.mdassets.extract_md_links— strips fenced code blocks, then collects internal-md targets.build_path_to_stem_map— case-insensitive map from relativeposix-path.md→ article stem, used for path-based resolution._normalise_md_target— normalisespages/x.md,./pages/x.md, and/pages/x.mdto the same key; strips#fragmentand?query; rejects paths that escapewiki_rootvia...resolve_md_link— resolves a CommonMark target toarticle:<stem>againstpath_map, gated by the known-article-IDs set just likeresolve_wikilink.parse_index— under each## Section, also collects CommonMark links into a newmd_links: list[str]parallel toarticles: list[str](existing key, unchanged shape).parse_wiki— buildsmd_category_lookup(keyed by resolvedarticle:<id>), threads md-link extraction through the article loop, emits bothrelatedandcategorized_underedges for md-link targets, and updates stats with a newmdLinkscounter.Backward compatibility
[[wikilink]]wikis produce identical edges, stats (modulo the newmdLinks: 0key), and node shapes (verified byParseWikiPureWikilinkRegressionTests).extract_wikilinksis untouched.cat["articles"]remainslist[str];cat["md_links"]is additive.knowledgeMeta.mdLinksis only emitted when the article has md-links (so existing pure-wikilink manifests stay byte-for-byte equivalent in that field too).md-linkssegment when non-zero.Test plan
Added
tests/skill/understand-knowledge/test_parse_knowledge_base.py— 38 unit tests across 7 classes:IsInternalMdTargetTests(6) — accepts relative/absolute/./md paths, with#anchoror?query; rejects external URLs, bare anchors, non-md assets.ExtractMdLinksTests(7) — basic extraction, skips image links, external URLs, fenced code blocks, anchors, non-md; preserves wikilinks untouched.NormaliseMdTargetTests(7) — bare relative,./prefix, absolute (/),../traversal, escape rejection, query/fragment strip, lowercase normalisation.ParseWikiCommonMarkOnlyTests(6) — regression for understand-knowledge: Karpathy wikis using CommonMark [](page.md) links yield 0 deterministic edges #361: pure CommonMark Karpathy wiki produces real edges, categorisation, and category membership counts.ParseWikiMixedSyntaxTests(4) — wikis using both[[ ]]and[](page.md)produce edges from both styles in bothrelatedandcategorized_under.ParseWikiPureWikilinkRegressionTests(4) — existing pure-wikilink wikis stay identical (nomdLinkskey, same edge set, same category lookups).ResolveMdLinkTests(4) —resolve_md_linkreturns correctarticle:<id>for relative + absolute; None for unresolved or out-of-set.End-to-end manual verification:
related+ 2categorized_underedges (was 0 pre-fix).python3 tests/skill/understand-knowledge/test_parse_knowledge_base.py -vtests/skill/understand/test_merge_batch_graphs.pystill pass.