Do influential papers in High Energy Physics Theory (HEP-Th) bridge diverse research communities, or do they reinforce disciplinary silos?
This project analyses the citation network from arXiv's High Energy Physics Theory (HEP-Th) category to understand whether influential papers serve as bridges connecting different research communities or as pillars reinforcing disciplinary boundaries.
Influential papers predominantly reinforce silos rather than bridging diverse research communities.
- Network Size: 27,770 papers, 352,807 citations
- Community Structure: High modularity (0.51) with 3 dominant communities containing 89% of papers
- Cross-Community Citations:
- Influential papers: ~15-18%
- Non-influential papers: ~12-13%
- Over 80% of citations stay within communities for all paper types
- Strong Disciplinary Boundaries: Three major research communities with clear intellectual separation
- Limited Bridging: Even the most influential papers derive over 80% of citations from within their own community
- Types of Influence:
- High betweenness papers: Best bridging (18%) - structural connectors
- High PageRank papers: Elite influence within narrow circles
- High in-degree papers: Most cited, reinforce community boundaries
- Structural vs. Direct Bridging: Weak negative correlation between betweenness centrality and cross-community citations suggests structural bridge positions don't translate to direct cross-boundary citations
-
Load Network & Basic Statistics
- Load HEP-Th citation network as directed graph
- Compute nodes, edges, density, degree distributions
-
Identify Influential Papers (Top 10%)
- In-degree centrality (most cited)
- PageRank (quality-weighted citations)
- Betweenness centrality (structural bridges)
-
Community Detection
- Modularity maximisation (greedy algorithm)
- Assign papers to communities
- Measure network modularity
-
Measure Bridging Behaviour
- Cross-community citation ratio = (citations to different communities) / (total citations)
- Calculate separately for in-citations and out-citations
-
Compare Groups
- Influential vs. non-influential papers
- Different types of influence (in-degree, PageRank, betweenness)
-
Visualise Results
- Box plots: Cross-community ratios by paper type
- Bar charts: Mean bridging behaviour
- Scatter plot: Betweenness vs. bridging behaviour
pip install -r requirements.txtRequired packages:
networkx>=3.0numpy>=1.20matplotlib>=3.5scipy>=1.7
# Clone the repository
git clone https://github.com/dhruva-divate/hep-citation-analysis.git
cd hep-citation-analysis
# Install dependencies
pip install -r requirements.txt
# Run the analysis
python hep_citation_analysis.py --data cit-HepTh.txt --output-dir resultspython hep_citation_analysis.pyThis assumes cit-HepTh.txt is in the current directory and will output results to the current directory.
python hep_citation_analysis.py \
--data path/to/cit-HepTh.txt \
--output-dir results/ \
--top-percent 10Arguments:
--data: Path to citation network edge list file (default:cit-HepTh.txt)--output-dir: Directory for output files (default: current directory)--top-percent: Percentage of top papers to consider influential (default: 10)
The script generates:
bridging_analysis.png: Comprehensive visualisation with 4 subplots- Console output: Detailed statistics and interpretations for each analysis step
- Dataset: SNAP: cit-HepTh
- Description: Citation network from arXiv HEP-Th (High Energy Physics - Theory)
- Format: Edge list (tab or space separated)
- Size: 27,770 papers, 352,807 citation links
- Time Period: Papers from January 1993 to April 2003
- cit-HepTh.txt # Citation network data (IS INCLUDED IN THIS REPO)
# Source Target
9201234 9301456
9301456 9401123
...
wget http://snap.stanford.edu/data/cit-HepTh.txt.gz
gunzip cit-HepTh.txt.gzThe visualisation includes four key plots:
- Top-left: In-citations cross-community ratios (who cites this paper)
- Top-right: Out-citations cross-community ratios (what this paper cites)
- Bottom-left: Mean bridging behaviour by paper type
- Bottom-right: Betweenness centrality vs. cross-community citation ratio
| Paper Type | In-Citations Mean | In-Citations Median | Out-Citations Mean | Out-Citations Median |
|---|---|---|---|---|
| High Betweenness | 18.5% | 11.1% | 18.3% | 11.1% |
| High PageRank | 15.1% | 9.4% | 17.0% | 3.5% |
| High In-Degree | 14.7% | 9.3% | 17.3% | 10.5% |
| Non-Influential | 13.3% | 0.0% | 11.9% | 0.0% |
Key Observation: Median of 0% for non-influential papers means over half have NO cross-community citations.
The code follows a sound logical progression:
- Network Loading: Proper directed graph construction from edge list
- Centrality Measures: Three complementary measures capture different types of influence
- Community Detection: Modularity maximisation on undirected graph
- Bridging Calculation: Separate in/out calculations with proper handling of edge cases
- Statistical Comparison: Comprehensive comparison across multiple paper groups
- Visualisation: Clear multi-panel figure
- Betweenness Centrality: Computationally expensive O(n³), script uses sampling (k=5000 nodes)
- Memory Usage: ~1-2 GB for full network
- Runtime: ~2-5 minutes on modern hardware
- Scalability: Can handle networks with 100K+ nodes with sampling
If you use this analysis in your work, please cite:
@misc{hep_citation_analysis_2025,
author = {Dhruva Divate},
title = {HEP-Th Citation Network Analysis: Bridging vs. Siloing},
year = {2025},
publisher = {GitHub},
url = {https://github.com/dhruva-divate/hep-citation-analysis}
}Original dataset:
@misc{snapnets,
author = {Jure Leskovec and Andrej Krevl},
title = {{SNAP Datasets}: {Stanford} Large Network Dataset Collection},
howpublished = {\url{http://snap.stanford.edu/data}},
month = jun,
year = 2014
}MIT Licence - see LICENCE file for details
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Potential extensions of this analysis:
- Temporal Analysis: How has bridging behaviour evolved over time?
- Content Analysis: Do paper abstracts/titles reveal topical differences between communities?
- Author Networks: Do author collaboration patterns mirror citation patterns?
- Comparative Study: Compare HEP-Th with other arXiv categories
- Intervention Studies: What factors promote cross-community citation?
- Stanford Network Analysis Project (SNAP) for providing the dataset
- arXiv for making scientific papers freely available
- NetworkX development team for excellent graph analysis tools
