Skip to content

Latest commit

 

History

History
487 lines (400 loc) · 17.6 KB

File metadata and controls

487 lines (400 loc) · 17.6 KB

Relatenta v1.1.0 — Research Insight Enhancement Development Specification

Version: 1.1.0 Date: 2026-03-13 Status: Development Goal: Transform from "visualization tool" to "research insight platform"


1. Overview

1.1 Objective

Add 7 new analysis features to provide actionable research insights beyond basic visualization. All features use existing data model and open-source libraries — no external API or ML model dependencies.

1.2 Feature List

# Feature New File Benchmarked From
F1 Community Detection & Cluster Visualization app/services_insight.py VOSviewer (Leiden algorithm)
F2 Burst Detection (Emerging Topics) app/services_insight.py CiteSpace (Kleinberg's burst)
F3 Collaborator Recommendation app/services_insight.py ResearchRabbit
F4 Shortest Path Analysis app/services_insight.py Inciteful (Literature Connector)
F5 Research Gap Detection app/services_insight.py Inciteful + SciSpace (Structural Holes)
F6 Strategic Diagram app/services_insight.py SciMAT + Bibliometrix
F7 Thematic Evolution app/services_insight.py Bibliometrix (Thematic Evolution Map)

1.3 Architecture Decision

  • All 7 features are implemented in a single new module: app/services_insight.py
  • Each feature is a standalone function that takes a SQLAlchemy Session and returns structured data
  • UI integration in streamlit_app.py via a new "Insights" tab
  • No new database tables required — all features compute from existing tables
  • New dependency: community (python-louvain) for community detection

2. Detailed Feature Specifications

F1: Community Detection & Cluster Visualization

Purpose: Automatically identify research communities/clusters within the network and color-code them for visual insight.

Algorithm: Louvain community detection (NetworkX + python-louvain)

  • Chosen over Leiden for compatibility (python-louvain is pip-installable, leidenalg requires C++ build)
  • Modularity-based optimization, resolution parameter exposed to user

Input:

  • db: Session — database session
  • layer: str — "authors", "keywords", "orgs", "nations"
  • resolution: float — resolution parameter (default 1.0, higher = more communities)
  • year_min/year_max: int | None — optional year filter

Processing:

  1. Build NetworkX Graph from existing edge tables (CoauthorEdge, keyword co-occurrence, OrgEdge, NationEdge)
  2. Apply year filter to restrict works/edges
  3. Run community.best_partition(G, resolution=resolution) to assign each node to a community
  4. Compute per-community statistics: size, density, top members

Output:

{
    "communities": {
        0: {"nodes": ["A1", "A2", ...], "size": 5, "density": 0.8, "label": "Top keyword or author"},
        1: {"nodes": ["A3", "A4", ...], "size": 3, "density": 0.6, "label": "..."},
        ...
    },
    "partition": {"A1": 0, "A2": 0, "A3": 1, ...},  # node_id -> community_id
    "modularity": 0.45,
    "num_communities": 4
}

UI:

  • Dropdown to select resolution parameter (0.5, 1.0, 1.5, 2.0)
  • Graph nodes colored by community assignment
  • Summary table showing community statistics
  • Expandable details per community (members, internal density, keywords)

Complexity: O(n log n) for Louvain algorithm


F2: Burst Detection (Emerging Topics)

Purpose: Identify keywords/topics that have experienced sudden growth in recent years — indicating emerging research fronts.

Algorithm: Growth rate-based burst detection

  • Calculate year-over-year growth rate for each keyword
  • Identify keywords with sustained high growth over a configurable window
  • Rank by burst score = (recent_count - baseline_count) / baseline_count

Input:

  • db: Session
  • window_years: int — burst detection window (default 3 years)
  • min_papers: int — minimum total papers for a keyword to be considered (default 3)

Processing:

  1. Query WorkKeyword joined with Work.year to get per-keyword per-year counts
  2. Calculate baseline (average count in years before window) and recent (average in window)
  3. Compute burst_score = (recent - baseline) / max(baseline, 1)
  4. Filter keywords with min_papers threshold
  5. Rank by burst_score descending

Output:

[
    {
        "keyword_id": 42,
        "keyword": "Federated Learning",
        "burst_score": 4.5,       # 450% growth
        "baseline_avg": 2.0,      # avg papers/year before window
        "recent_avg": 11.0,       # avg papers/year in window
        "trend": [0, 1, 2, 3, 5, 8, 15, 20],  # yearly counts
        "years": [2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025],
        "status": "burst"         # burst | growing | stable | declining
    },
    ...
]

UI:

  • Bar chart of top 15 bursting keywords with burst scores
  • Sparkline mini-charts showing per-keyword trend
  • Status badges (Burst / Growing / Stable / Declining)
  • Configurable window_years slider

F3: Collaborator Recommendation

Purpose: Suggest potential collaborators based on research interest overlap and network proximity.

Algorithm: Multi-signal scoring

  1. Keyword Similarity (Jaccard): |shared_keywords| / |union_keywords| between two authors
  2. Network Proximity: Common neighbor count in co-authorship network
  3. Complementarity Bonus: Reward authors with some overlap but also unique expertise

Input:

  • db: Session
  • author_id: int — target author for recommendations
  • top_n: int — number of recommendations (default 10)

Processing:

  1. Get target author's keyword set from WorkKeyword via WorkAuthor
  2. For each other author NOT already a co-author: a. Compute Jaccard similarity of keyword sets b. Count common co-authorship neighbors (friends-of-friends) c. Compute complementarity = |their_unique_keywords| / |total_keywords|
  3. Score = 0.5 * jaccard + 0.3 * normalized_common_neighbors + 0.2 * complementarity
  4. Rank by score descending, return top_n

Output:

[
    {
        "author_id": 15,
        "author_name": "Dr. Jane Smith",
        "score": 0.72,
        "jaccard_similarity": 0.65,
        "common_neighbors": 3,
        "common_neighbor_names": ["Alice", "Bob", "Carol"],
        "shared_keywords": ["Deep Learning", "NLP"],
        "unique_keywords": ["Reinforcement Learning", "Robotics"],
        "path_length": 2  # shortest path distance in co-authorship network
    },
    ...
]

UI:

  • Select target author from dropdown
  • Ranked table of recommendations with similarity scores
  • Visual breakdown: shared keywords, common neighbors, unique expertise
  • Click to highlight recommended author in graph

F4: Shortest Path Analysis

Purpose: Find the shortest collaboration path between any two researchers — enabling networking opportunity discovery.

Algorithm: NetworkX shortest_path (BFS-based for unweighted)

Input:

  • db: Session
  • source_id: int — starting author
  • target_id: int — destination author
  • layer: str — "authors" (default), extensible to other layers

Processing:

  1. Build NetworkX Graph from CoauthorEdge table
  2. Run nx.shortest_path(G, source, target) to find shortest path
  3. If no path exists, report disconnected
  4. For each intermediate node, fetch author details
  5. For each edge in path, fetch shared publications

Output:

{
    "path_exists": True,
    "path_length": 3,
    "path": [
        {"author_id": 1, "name": "Source Author"},
        {"author_id": 5, "name": "Intermediary 1", "shared_papers_with_prev": 4},
        {"author_id": 8, "name": "Intermediary 2", "shared_papers_with_prev": 2},
        {"author_id": 12, "name": "Target Author", "shared_papers_with_prev": 1}
    ],
    "all_paths": [...]  # up to 5 alternative paths (if available)
}

UI:

  • Two dropdown selectors for source and target authors
  • Visual path display with author names and edge weights
  • "No path found" message if disconnected
  • Alternative paths (if any)

F5: Research Gap Detection (Structural Holes)

Purpose: Identify under-explored research areas by finding "structural holes" in the keyword co-occurrence network — pairs of active keyword clusters that rarely appear together.

Algorithm: Burt's Structural Holes + Bridge Detection

  1. Build keyword co-occurrence network
  2. Detect communities (reuse F1)
  3. Identify cross-community edges with low weight relative to intra-community edges
  4. Compute bridging score for each cross-community keyword pair

Input:

  • db: Session
  • year_min/year_max: int | None
  • min_keyword_count: int — minimum papers for a keyword to be included (default 3)
  • top_n: int — number of gaps to return (default 15)

Processing:

  1. Build keyword co-occurrence Graph from works (filtered by year)
  2. Run community detection on keyword network
  3. For each pair of communities (C_i, C_j): a. Count inter-community edges and their total weight b. Count intra-community edges for both c. Compute gap_score = (intra_density_avg - inter_density) / intra_density_avg
  4. For each gap, identify the most representative keywords from each community
  5. Rank by gap_score * community_size_product

Output:

[
    {
        "community_a": {"id": 0, "top_keywords": ["Deep Learning", "CNN"]},
        "community_b": {"id": 2, "top_keywords": ["Medical Ethics", "Policy"]},
        "gap_score": 0.85,
        "inter_edges": 2,
        "potential_bridges": ["AI Ethics"],  # keywords that weakly connect both
        "suggestion": "Deep Learning + Medical Ethics: active individually but rarely combined"
    },
    ...
]

UI:

  • Table of research gaps ranked by score
  • Each gap shows the two keyword clusters and their weak connection
  • "Bridge keywords" that weakly span both clusters
  • Expandable suggestion text

F6: Strategic Diagram

Purpose: Map research themes on a 2D strategic diagram (centrality vs density) to classify them as Motor/Niche/Emerging/Declining.

Algorithm: Callon's centrality-density analysis (SciMAT methodology)

  • Centrality (X-axis): External cohesion — how strongly a keyword cluster connects to other clusters
  • Density (Y-axis): Internal cohesion — how strongly keywords within a cluster are interconnected

Input:

  • db: Session
  • year_min/year_max: int | None
  • min_keyword_count: int — minimum papers for inclusion (default 3)

Processing:

  1. Build keyword co-occurrence network (filtered by year)
  2. Run community detection (reuse F1 result)
  3. For each community C_i: a. Density = sum of intra-community edge weights / (|C_i| * (|C_i| - 1) / 2) b. Centrality = sum of inter-community edge weights (C_i to all other clusters) / (|C_i| * |all_other_nodes|)
  4. Normalize both to 0-1 range
  5. Classify quadrant:
    • Q1 (high centrality, high density) = Motor themes — well-developed and central
    • Q2 (low centrality, high density) = Niche themes — well-developed but peripheral
    • Q3 (low centrality, low density) = Emerging or Declining themes
    • Q4 (high centrality, low density) = Basic/Transversal themes — central but underdeveloped

Output:

{
    "themes": [
        {
            "cluster_id": 0,
            "label": "Deep Learning / CNN",
            "top_keywords": ["Deep Learning", "CNN", "Image Recognition"],
            "centrality": 0.82,
            "density": 0.75,
            "quadrant": "Motor",
            "size": 12,  # number of keywords
            "total_papers": 150
        },
        ...
    ],
    "median_centrality": 0.5,
    "median_density": 0.5
}

UI:

  • Plotly scatter plot with 4 quadrants
  • X-axis: Centrality, Y-axis: Density
  • Bubble size = number of papers, color = quadrant
  • Quadrant labels: Motor / Niche / Emerging or Declining / Basic & Transversal
  • Hover tooltip showing top keywords and paper count
  • Median lines dividing the quadrants

F7: Thematic Evolution

Purpose: Show how research themes evolve over time — how keyword clusters form, merge, split, or disappear across time periods.

Algorithm: Temporal keyword clustering + Sankey/alluvial flow

  1. Divide time range into periods
  2. Run community detection on each period's keyword network
  3. Map cluster continuity across periods by keyword overlap

Input:

  • db: Session
  • n_periods: int — number of time slices (default 3)
  • min_keyword_count: int — minimum papers per keyword per period (default 2)

Processing:

  1. Get year range from data, divide into n_periods equal intervals
  2. For each period: a. Build keyword co-occurrence network (works within that period) b. Run community detection c. Label each community by its top-2 keywords
  3. For consecutive periods (P_i, P_{i+1}): a. For each community C in P_i and community D in P_{i+1}:
    • Compute overlap = |keywords_in_C ∩ keywords_in_D| / |keywords_in_C ∪ keywords_in_D| b. Create flow edges where overlap > threshold (default 0.1) c. Flow weight = overlap * min(size_C, size_D)
  4. Classify evolution events:
    • Continuation: One cluster maps primarily to one cluster
    • Merge: Multiple clusters map to one
    • Split: One cluster maps to multiple
    • Emergence: Cluster with no predecessor
    • Disappearance: Cluster with no successor

Output:

{
    "periods": [
        {"label": "2018-2020", "start": 2018, "end": 2020},
        {"label": "2021-2023", "start": 2021, "end": 2023},
        {"label": "2024-2026", "start": 2024, "end": 2026}
    ],
    "nodes": [
        {"id": "P0_C0", "label": "Deep Learning / CNN", "period": 0, "size": 45},
        {"id": "P0_C1", "label": "NLP / RNN", "period": 0, "size": 30},
        {"id": "P1_C0", "label": "Deep Learning / Transformer", "period": 1, "size": 60},
        ...
    ],
    "flows": [
        {"source": "P0_C0", "target": "P1_C0", "weight": 35, "overlap": 0.65},
        {"source": "P0_C1", "target": "P1_C0", "weight": 20, "overlap": 0.40},
        ...
    ],
    "events": [
        {"type": "merge", "description": "Deep Learning/CNN + NLP/RNN merged into Deep Learning/Transformer"}
    ]
}

UI:

  • Plotly Sankey diagram showing flows between periods
  • Each column = one time period
  • Node height = cluster size (paper count)
  • Flow width = keyword overlap strength
  • Color coding by cluster identity
  • Evolution event annotations

3. UI Integration Plan

3.1 New Tab: "Insights"

Add a 5th tab to the main interface (after Reports, before How-to):

Graph | Heatmaps | Reports | Insights | How-to

3.2 Insights Tab Layout

+--------------------------------------------------+
| Insights                                          |
|                                                   |
| [Analysis Type ▼]  [Run Analysis]                |
|                                                   |
| ┌──── Analysis Options ───────────────────────┐  |
| │ (Dynamic controls based on selected type)    │  |
| └──────────────────────────────────────────────┘  |
|                                                   |
| ┌──── Results ─────────────────────────────────┐  |
| │ (Charts, tables, visualizations)             │  |
| └──────────────────────────────────────────────┘  |
+--------------------------------------------------+

3.3 Analysis Type Options

  1. Community Detection
  2. Emerging Topics (Burst Detection)
  3. Collaborator Recommendation
  4. Shortest Path (Networking Path)
  5. Research Gap Detection
  6. Strategic Diagram
  7. Thematic Evolution

4. Dependencies

4.1 New Package

python-louvain>=0.16    # community detection (Louvain algorithm)

4.2 Existing Packages (no changes)

  • networkx>=3.0 — graph construction, shortest path, centrality
  • plotly>=5.15.0 — Sankey diagram, scatter plot, bar charts
  • pandas>=2.0.0 — data processing

5. File Changes Summary

File Change Type Description
app/services_insight.py NEW All 7 insight analysis functions
streamlit_app.py MODIFY Add Insights tab, import services_insight
requirements.txt MODIFY Add python-louvain
VERSION MODIFY Update to 1.1.0
CHANGELOG.md MODIFY Add v1.1.0 entry
README.md MODIFY Update feature list and documentation links

6. Testing Strategy

6.1 Unit Test Scenarios

Each function tested with:

  • Empty database (should return empty/default results)
  • Single author/keyword (edge cases)
  • Normal dataset (Geoffrey Hinton demo data)
  • Year filtering applied

6.2 Integration Test

  • Full workflow: Ingest data -> Run each analysis -> Verify UI renders without error

6.3 Performance Expectations

  • All analyses should complete within 5 seconds for datasets up to 1,000 works
  • Community detection: O(n log n)
  • Shortest path: O(V + E)
  • Burst detection: O(K * Y) where K=keywords, Y=years
  • Strategic diagram: O(K^2) worst case
  • Thematic evolution: O(P * K^2) where P=periods

7. Risk Assessment

Risk Mitigation
python-louvain not installable Fallback to NetworkX greedy_modularity_communities
Large dataset performance Apply top-N filtering before expensive computations
No data for analysis Show informative empty state messages
Community detection returns 1 community Show message "Network too small or uniform for community detection"
No path between authors Show "No collaboration path found" with explanation