Optimization Proposal: Consolidating Tree-sitter Dependencies and Localizing Document Parsing #118

1353604736 · 2026-04-09T05:27:05Z

1353604736
Apr 9, 2026

I've been diving deep into Graphify's architecture, particularly the Semantic Extraction pipeline in skill.md and the multi-language support in extract.py. The use of parallel subagents for vision and citation mining is an impressive way to bridge implementation and design.

Based on the current implementation, I’d like to propose two optimizations to improve maintainability and performance:

Consolidate 13+ Grammars into tree-sitter-language-pack
Currently, pyproject.toml manages 13 separate tree-sitter- language packages.

The Issue: Maintaining version parity and installation hooks for a dozen individual dependencies is a significant overhead for contributors and users.
The Proposal: Switch to tree-sitter-language-pack.
The Benefit: It bundles the vast majority of active grammars into a single dependency. This would simplify pyproject.toml, reduce install-time friction, and instantly expand Graphify's "Code Path" to support dozens of additional languages (Zig, Elixir, Julia, etc.) that are currently missing from the explicit dependency list.

Local "Fast-Path" for Semantic Extraction via kreuzberg
The current pipeline relies heavily on Claude/Vision subagents for Part B (PDFs, Images, and Docs).

The Issue: Sending every non-code file to an LLM for initial text/concept extraction is token-intensive and adds significant latency (~45s per batch).
The Proposal: Integrate kreuzberg as a local preprocessing layer in the semantic pipeline.
The Benefit:
- Offline OCR/Parsing: kreuzberg can handle the heavy lifting of OCR and PDF structured text extraction locally.
- Filtered Context: Instead of sending raw, noisy PDF/Docx content to subagents, we can send cleaned, structured text. This reduces the token load on Claude and allows subagents to focus on high-level Relationship Inference rather than basic Text Recognition.
- Performance: It could potentially move many "Doc/Paper" files from the slow "Semantic Path" to a much faster local processing path.

Summary:

Tree-sitter-language-pack simplifies the Developer Experience (DX).
Kreuzberg optimizes the Operational Cost (Tokens/Time) of the semantic pipeline.

I'm curious to know if you've considered a unified language pack before, or if the current manual dependency management was a conscious choice for build size optimization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization Proposal: Consolidating Tree-sitter Dependencies and Localizing Document Parsing #118

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Optimization Proposal: Consolidating Tree-sitter Dependencies and Localizing Document Parsing #118

Uh oh!

1353604736 Apr 9, 2026

Replies: 0 comments

1353604736
Apr 9, 2026