Version: 1.1.0 Date: 2026-03-13 Status: Development Goal: Transform from "visualization tool" to "research insight platform"
Add 7 new analysis features to provide actionable research insights beyond basic visualization. All features use existing data model and open-source libraries — no external API or ML model dependencies.
| # | Feature | New File | Benchmarked From |
|---|---|---|---|
| F1 | Community Detection & Cluster Visualization | app/services_insight.py |
VOSviewer (Leiden algorithm) |
| F2 | Burst Detection (Emerging Topics) | app/services_insight.py |
CiteSpace (Kleinberg's burst) |
| F3 | Collaborator Recommendation | app/services_insight.py |
ResearchRabbit |
| F4 | Shortest Path Analysis | app/services_insight.py |
Inciteful (Literature Connector) |
| F5 | Research Gap Detection | app/services_insight.py |
Inciteful + SciSpace (Structural Holes) |
| F6 | Strategic Diagram | app/services_insight.py |
SciMAT + Bibliometrix |
| F7 | Thematic Evolution | app/services_insight.py |
Bibliometrix (Thematic Evolution Map) |
- All 7 features are implemented in a single new module:
app/services_insight.py - Each feature is a standalone function that takes a SQLAlchemy
Sessionand returns structured data - UI integration in
streamlit_app.pyvia a new "Insights" tab - No new database tables required — all features compute from existing tables
- New dependency:
community(python-louvain) for community detection
Purpose: Automatically identify research communities/clusters within the network and color-code them for visual insight.
Algorithm: Louvain community detection (NetworkX + python-louvain)
- Chosen over Leiden for compatibility (python-louvain is pip-installable, leidenalg requires C++ build)
- Modularity-based optimization, resolution parameter exposed to user
Input:
db: Session— database sessionlayer: str— "authors", "keywords", "orgs", "nations"resolution: float— resolution parameter (default 1.0, higher = more communities)year_min/year_max: int | None— optional year filter
Processing:
- Build NetworkX
Graphfrom existing edge tables (CoauthorEdge, keyword co-occurrence, OrgEdge, NationEdge) - Apply year filter to restrict works/edges
- Run
community.best_partition(G, resolution=resolution)to assign each node to a community - Compute per-community statistics: size, density, top members
Output:
{
"communities": {
0: {"nodes": ["A1", "A2", ...], "size": 5, "density": 0.8, "label": "Top keyword or author"},
1: {"nodes": ["A3", "A4", ...], "size": 3, "density": 0.6, "label": "..."},
...
},
"partition": {"A1": 0, "A2": 0, "A3": 1, ...}, # node_id -> community_id
"modularity": 0.45,
"num_communities": 4
}UI:
- Dropdown to select resolution parameter (0.5, 1.0, 1.5, 2.0)
- Graph nodes colored by community assignment
- Summary table showing community statistics
- Expandable details per community (members, internal density, keywords)
Complexity: O(n log n) for Louvain algorithm
Purpose: Identify keywords/topics that have experienced sudden growth in recent years — indicating emerging research fronts.
Algorithm: Growth rate-based burst detection
- Calculate year-over-year growth rate for each keyword
- Identify keywords with sustained high growth over a configurable window
- Rank by burst score = (recent_count - baseline_count) / baseline_count
Input:
db: Sessionwindow_years: int— burst detection window (default 3 years)min_papers: int— minimum total papers for a keyword to be considered (default 3)
Processing:
- Query
WorkKeywordjoined withWork.yearto get per-keyword per-year counts - Calculate baseline (average count in years before window) and recent (average in window)
- Compute burst_score = (recent - baseline) / max(baseline, 1)
- Filter keywords with min_papers threshold
- Rank by burst_score descending
Output:
[
{
"keyword_id": 42,
"keyword": "Federated Learning",
"burst_score": 4.5, # 450% growth
"baseline_avg": 2.0, # avg papers/year before window
"recent_avg": 11.0, # avg papers/year in window
"trend": [0, 1, 2, 3, 5, 8, 15, 20], # yearly counts
"years": [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025],
"status": "burst" # burst | growing | stable | declining
},
...
]UI:
- Bar chart of top 15 bursting keywords with burst scores
- Sparkline mini-charts showing per-keyword trend
- Status badges (Burst / Growing / Stable / Declining)
- Configurable window_years slider
Purpose: Suggest potential collaborators based on research interest overlap and network proximity.
Algorithm: Multi-signal scoring
- Keyword Similarity (Jaccard): |shared_keywords| / |union_keywords| between two authors
- Network Proximity: Common neighbor count in co-authorship network
- Complementarity Bonus: Reward authors with some overlap but also unique expertise
Input:
db: Sessionauthor_id: int— target author for recommendationstop_n: int— number of recommendations (default 10)
Processing:
- Get target author's keyword set from WorkKeyword via WorkAuthor
- For each other author NOT already a co-author: a. Compute Jaccard similarity of keyword sets b. Count common co-authorship neighbors (friends-of-friends) c. Compute complementarity = |their_unique_keywords| / |total_keywords|
- Score = 0.5 * jaccard + 0.3 * normalized_common_neighbors + 0.2 * complementarity
- Rank by score descending, return top_n
Output:
[
{
"author_id": 15,
"author_name": "Dr. Jane Smith",
"score": 0.72,
"jaccard_similarity": 0.65,
"common_neighbors": 3,
"common_neighbor_names": ["Alice", "Bob", "Carol"],
"shared_keywords": ["Deep Learning", "NLP"],
"unique_keywords": ["Reinforcement Learning", "Robotics"],
"path_length": 2 # shortest path distance in co-authorship network
},
...
]UI:
- Select target author from dropdown
- Ranked table of recommendations with similarity scores
- Visual breakdown: shared keywords, common neighbors, unique expertise
- Click to highlight recommended author in graph
Purpose: Find the shortest collaboration path between any two researchers — enabling networking opportunity discovery.
Algorithm: NetworkX shortest_path (BFS-based for unweighted)
Input:
db: Sessionsource_id: int— starting authortarget_id: int— destination authorlayer: str— "authors" (default), extensible to other layers
Processing:
- Build NetworkX Graph from CoauthorEdge table
- Run
nx.shortest_path(G, source, target)to find shortest path - If no path exists, report disconnected
- For each intermediate node, fetch author details
- For each edge in path, fetch shared publications
Output:
{
"path_exists": True,
"path_length": 3,
"path": [
{"author_id": 1, "name": "Source Author"},
{"author_id": 5, "name": "Intermediary 1", "shared_papers_with_prev": 4},
{"author_id": 8, "name": "Intermediary 2", "shared_papers_with_prev": 2},
{"author_id": 12, "name": "Target Author", "shared_papers_with_prev": 1}
],
"all_paths": [...] # up to 5 alternative paths (if available)
}UI:
- Two dropdown selectors for source and target authors
- Visual path display with author names and edge weights
- "No path found" message if disconnected
- Alternative paths (if any)
Purpose: Identify under-explored research areas by finding "structural holes" in the keyword co-occurrence network — pairs of active keyword clusters that rarely appear together.
Algorithm: Burt's Structural Holes + Bridge Detection
- Build keyword co-occurrence network
- Detect communities (reuse F1)
- Identify cross-community edges with low weight relative to intra-community edges
- Compute bridging score for each cross-community keyword pair
Input:
db: Sessionyear_min/year_max: int | Nonemin_keyword_count: int— minimum papers for a keyword to be included (default 3)top_n: int— number of gaps to return (default 15)
Processing:
- Build keyword co-occurrence Graph from works (filtered by year)
- Run community detection on keyword network
- For each pair of communities (C_i, C_j): a. Count inter-community edges and their total weight b. Count intra-community edges for both c. Compute gap_score = (intra_density_avg - inter_density) / intra_density_avg
- For each gap, identify the most representative keywords from each community
- Rank by gap_score * community_size_product
Output:
[
{
"community_a": {"id": 0, "top_keywords": ["Deep Learning", "CNN"]},
"community_b": {"id": 2, "top_keywords": ["Medical Ethics", "Policy"]},
"gap_score": 0.85,
"inter_edges": 2,
"potential_bridges": ["AI Ethics"], # keywords that weakly connect both
"suggestion": "Deep Learning + Medical Ethics: active individually but rarely combined"
},
...
]UI:
- Table of research gaps ranked by score
- Each gap shows the two keyword clusters and their weak connection
- "Bridge keywords" that weakly span both clusters
- Expandable suggestion text
Purpose: Map research themes on a 2D strategic diagram (centrality vs density) to classify them as Motor/Niche/Emerging/Declining.
Algorithm: Callon's centrality-density analysis (SciMAT methodology)
- Centrality (X-axis): External cohesion — how strongly a keyword cluster connects to other clusters
- Density (Y-axis): Internal cohesion — how strongly keywords within a cluster are interconnected
Input:
db: Sessionyear_min/year_max: int | Nonemin_keyword_count: int— minimum papers for inclusion (default 3)
Processing:
- Build keyword co-occurrence network (filtered by year)
- Run community detection (reuse F1 result)
- For each community C_i: a. Density = sum of intra-community edge weights / (|C_i| * (|C_i| - 1) / 2) b. Centrality = sum of inter-community edge weights (C_i to all other clusters) / (|C_i| * |all_other_nodes|)
- Normalize both to 0-1 range
- Classify quadrant:
- Q1 (high centrality, high density) = Motor themes — well-developed and central
- Q2 (low centrality, high density) = Niche themes — well-developed but peripheral
- Q3 (low centrality, low density) = Emerging or Declining themes
- Q4 (high centrality, low density) = Basic/Transversal themes — central but underdeveloped
Output:
{
"themes": [
{
"cluster_id": 0,
"label": "Deep Learning / CNN",
"top_keywords": ["Deep Learning", "CNN", "Image Recognition"],
"centrality": 0.82,
"density": 0.75,
"quadrant": "Motor",
"size": 12, # number of keywords
"total_papers": 150
},
...
],
"median_centrality": 0.5,
"median_density": 0.5
}UI:
- Plotly scatter plot with 4 quadrants
- X-axis: Centrality, Y-axis: Density
- Bubble size = number of papers, color = quadrant
- Quadrant labels: Motor / Niche / Emerging or Declining / Basic & Transversal
- Hover tooltip showing top keywords and paper count
- Median lines dividing the quadrants
Purpose: Show how research themes evolve over time — how keyword clusters form, merge, split, or disappear across time periods.
Algorithm: Temporal keyword clustering + Sankey/alluvial flow
- Divide time range into periods
- Run community detection on each period's keyword network
- Map cluster continuity across periods by keyword overlap
Input:
db: Sessionn_periods: int— number of time slices (default 3)min_keyword_count: int— minimum papers per keyword per period (default 2)
Processing:
- Get year range from data, divide into n_periods equal intervals
- For each period: a. Build keyword co-occurrence network (works within that period) b. Run community detection c. Label each community by its top-2 keywords
- For consecutive periods (P_i, P_{i+1}):
a. For each community C in P_i and community D in P_{i+1}:
- Compute overlap = |keywords_in_C ∩ keywords_in_D| / |keywords_in_C ∪ keywords_in_D| b. Create flow edges where overlap > threshold (default 0.1) c. Flow weight = overlap * min(size_C, size_D)
- Classify evolution events:
- Continuation: One cluster maps primarily to one cluster
- Merge: Multiple clusters map to one
- Split: One cluster maps to multiple
- Emergence: Cluster with no predecessor
- Disappearance: Cluster with no successor
Output:
{
"periods": [
{"label": "2018-2020", "start": 2018, "end": 2020},
{"label": "2021-2023", "start": 2021, "end": 2023},
{"label": "2024-2026", "start": 2024, "end": 2026}
],
"nodes": [
{"id": "P0_C0", "label": "Deep Learning / CNN", "period": 0, "size": 45},
{"id": "P0_C1", "label": "NLP / RNN", "period": 0, "size": 30},
{"id": "P1_C0", "label": "Deep Learning / Transformer", "period": 1, "size": 60},
...
],
"flows": [
{"source": "P0_C0", "target": "P1_C0", "weight": 35, "overlap": 0.65},
{"source": "P0_C1", "target": "P1_C0", "weight": 20, "overlap": 0.40},
...
],
"events": [
{"type": "merge", "description": "Deep Learning/CNN + NLP/RNN merged into Deep Learning/Transformer"}
]
}UI:
- Plotly Sankey diagram showing flows between periods
- Each column = one time period
- Node height = cluster size (paper count)
- Flow width = keyword overlap strength
- Color coding by cluster identity
- Evolution event annotations
Add a 5th tab to the main interface (after Reports, before How-to):
Graph | Heatmaps | Reports | Insights | How-to
+--------------------------------------------------+
| Insights |
| |
| [Analysis Type ▼] [Run Analysis] |
| |
| ┌──── Analysis Options ───────────────────────┐ |
| │ (Dynamic controls based on selected type) │ |
| └──────────────────────────────────────────────┘ |
| |
| ┌──── Results ─────────────────────────────────┐ |
| │ (Charts, tables, visualizations) │ |
| └──────────────────────────────────────────────┘ |
+--------------------------------------------------+
- Community Detection
- Emerging Topics (Burst Detection)
- Collaborator Recommendation
- Shortest Path (Networking Path)
- Research Gap Detection
- Strategic Diagram
- Thematic Evolution
python-louvain>=0.16 # community detection (Louvain algorithm)
networkx>=3.0— graph construction, shortest path, centralityplotly>=5.15.0— Sankey diagram, scatter plot, bar chartspandas>=2.0.0— data processing
| File | Change Type | Description |
|---|---|---|
app/services_insight.py |
NEW | All 7 insight analysis functions |
streamlit_app.py |
MODIFY | Add Insights tab, import services_insight |
requirements.txt |
MODIFY | Add python-louvain |
VERSION |
MODIFY | Update to 1.1.0 |
CHANGELOG.md |
MODIFY | Add v1.1.0 entry |
README.md |
MODIFY | Update feature list and documentation links |
Each function tested with:
- Empty database (should return empty/default results)
- Single author/keyword (edge cases)
- Normal dataset (Geoffrey Hinton demo data)
- Year filtering applied
- Full workflow: Ingest data -> Run each analysis -> Verify UI renders without error
- All analyses should complete within 5 seconds for datasets up to 1,000 works
- Community detection: O(n log n)
- Shortest path: O(V + E)
- Burst detection: O(K * Y) where K=keywords, Y=years
- Strategic diagram: O(K^2) worst case
- Thematic evolution: O(P * K^2) where P=periods
| Risk | Mitigation |
|---|---|
| python-louvain not installable | Fallback to NetworkX greedy_modularity_communities |
| Large dataset performance | Apply top-N filtering before expensive computations |
| No data for analysis | Show informative empty state messages |
| Community detection returns 1 community | Show message "Network too small or uniform for community detection" |
| No path between authors | Show "No collaboration path found" with explanation |