perf: use igraph C backend for betweenness centrality (~100x speedup)#371
Open
s-a-c wants to merge 1 commit intosafishamsi:mainfrom
Open
perf: use igraph C backend for betweenness centrality (~100x speedup)#371s-a-c wants to merge 1 commit intosafishamsi:mainfrom
s-a-c wants to merge 1 commit intosafishamsi:mainfrom
Conversation
Replace nx.betweenness_centrality() and nx.edge_betweenness_centrality() with tiered helpers that prefer igraph's C implementation (~30-100x faster), falling back to NetworkX sampled approximation (k=500) for large graphs, and NetworkX exact for small graphs (<5K nodes). On a real-world corpus (33K files, 167K nodes, 202K edges), the full pipeline dropped from >1h43m (never finished) to 30 seconds. * Added _fast_betweenness_centrality() helper * Added _fast_edge_betweenness_centrality() helper * Added _nx_to_igraph() NX→igraph conversion utility * Replaced both call sites in suggest_questions() and _cross_community_surprises() Co-Authored-By: Oz <oz-agent@warp.dev>
There was a problem hiding this comment.
Pull request overview
This PR optimizes graph analysis runtime by replacing slow pure-Python NetworkX betweenness centrality calls with a tiered strategy that prefers igraph’s C backend and falls back to NetworkX exact/sampled computation.
Changes:
- Added helper utilities to compute node/edge betweenness centrality via igraph when available, otherwise falling back to NetworkX (sampled for large graphs).
- Tightened file-node detection to rely on
source_file/basename matching instead of “label ends with extension”. - Improved robustness around
_src/_tgtedge metadata and stabilizedgraph_diffedge identity for undirected graphs.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+73
to
+81
| scale = (n - 1) * (n - 2) if n > 2 else 1 | ||
| if not G.is_directed(): | ||
| scale /= 2 | ||
| result = {} | ||
| for idx, eb in enumerate(raw): | ||
| edge = ig_graph.es[idx] | ||
| u = node_list[edge.source] | ||
| v = node_list[edge.target] | ||
| result[(u, v)] = eb / scale if scale else 0.0 |
Comment on lines
+42
to
+47
| # igraph returns un-normalized values; normalize like NetworkX: | ||
| # BC_norm = BC_raw / ((n-1)*(n-2)) for undirected | ||
| # BC_norm = BC_raw / ((n-1)*(n-2)/2) — but NX uses (n-1)(n-2) for both | ||
| scale = (n - 1) * (n - 2) if n > 2 else 1 | ||
| if not G.is_directed(): | ||
| scale /= 2 # igraph counts each path once; NX normalizes by (n-1)(n-2) |
Comment on lines
+56
to
+57
| k = min(_MAX_SAMPLE_K, n) | ||
| return nx.betweenness_centrality(G, k=k) |
| _DOC_EXTENSIONS = {"md", "txt", "rst"} | ||
| _PAPER_EXTENSIONS = {"pdf"} | ||
| _IMAGE_EXTENSIONS = {"png", "jpg", "jpeg", "webp", "gif", "svg"} | ||
| from graphify.detect import CODE_EXTENSIONS, DOC_EXTENSIONS, PAPER_EXTENSIONS, IMAGE_EXTENSIONS |
Comment on lines
308
to
+313
| src_id = data.get("_src", u) | ||
| if src_id not in G.nodes: | ||
| src_id = u | ||
| tgt_id = data.get("_tgt", v) | ||
| if tgt_id not in G.nodes: | ||
| tgt_id = v |
Comment on lines
+26
to
+58
| def _fast_betweenness_centrality(G: nx.Graph) -> dict[str, float]: | ||
| """Compute node betweenness centrality using the fastest available backend. | ||
|
|
||
| Priority: | ||
| 1. igraph (C implementation) — exact, ~100x faster than pure-Python NX. | ||
| 2. NetworkX with sampled approximation (k source nodes) for large graphs. | ||
| 3. NetworkX exact for small graphs. | ||
| """ | ||
| n = G.number_of_nodes() | ||
| if n == 0: | ||
| return {} | ||
|
|
||
| # --- igraph path (preferred) ------------------------------------------- | ||
| try: | ||
| ig_graph, node_list = _nx_to_igraph(G) | ||
| raw = ig_graph.betweenness(directed=G.is_directed()) | ||
| # igraph returns un-normalized values; normalize like NetworkX: | ||
| # BC_norm = BC_raw / ((n-1)*(n-2)) for undirected | ||
| # BC_norm = BC_raw / ((n-1)*(n-2)/2) — but NX uses (n-1)(n-2) for both | ||
| scale = (n - 1) * (n - 2) if n > 2 else 1 | ||
| if not G.is_directed(): | ||
| scale /= 2 # igraph counts each path once; NX normalizes by (n-1)(n-2) | ||
| return {node_list[i]: (raw[i] / scale if scale else 0.0) for i in range(n)} | ||
| except Exception: | ||
| pass # igraph not installed or conversion error — fall back | ||
|
|
||
| # --- NetworkX path (fallback) ------------------------------------------ | ||
| if n <= _SAMPLE_THRESHOLD: | ||
| return nx.betweenness_centrality(G) | ||
|
|
||
| k = min(_MAX_SAMPLE_K, n) | ||
| return nx.betweenness_centrality(G, k=k) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
nx.betweenness_centrality(G)inanalyze.pyuses NetworkX's pure-Python O(V×E) implementation, which becomes a hard bottleneck on large codebases. On a 33K-file corpus (167K nodes, 202K edges),graphify updateran for 1 hour 43 minutes and never completed — stuck in the BFS inner loop of betweenness centrality.A second call site,
nx.edge_betweenness_centrality(G)in_cross_community_surprises(), has the same issue.Solution
Replace both bare NetworkX calls with tiered helper functions:
python-igraphas an optional dependency.k=500sampled source nodes for graphs >5K nodes (O(k×E) instead of O(V×E)).The helpers (
_fast_betweenness_centrality,_fast_edge_betweenness_centrality) are drop-in replacements — same return types, same normalization.Benchmark Results
Real-world validation
Changes
_nx_to_igraph()— NX→igraph graph conversion utility_fast_betweenness_centrality()— tiered node betweenness_fast_edge_betweenness_centrality()— tiered edge betweennesssuggest_questions()_cross_community_surprises()Notes
python-igraphis an optional dependency — if not installed, the helpers gracefully degrade to the NetworkX fallback pathWarp conversation
Co-Authored-By: Oz oz-agent@warp.dev