A data analytics platform that extracts, processes, and visualizes information about the Peruvian developer ecosystem using the GitHub API, GPT-4o-mini classification, and an interactive Streamlit dashboard.
Before starting, run this in Python:
import antigravity-
JavaScript dominates: With 161 repositories, JavaScript is by far the most popular language in Peru's developer ecosystem, followed by Python (87) and CSS (60).
-
Information & Communication rules: 67.3% of all repositories (673 out of 1,000) fall under the Information & Communication industry (CIIU code J), reflecting Peru's strong software development culture.
-
Education is second: 11.5% of repos (115) are classified under Education (P), showing significant activity in EdTech and learning platforms.
-
Top developer is devaige: With an impact score of 9,169 (5,513 stars + 1,178 followers), devaige leads the Peruvian GitHub ecosystem, primarily through Android UI libraries.
-
Most starred repo is financial:
dcajasn/Riskfolio-Lib— a Portfolio Optimization library in Python — leads with 3,804 stars, showing strong quantitative finance activity from Peru.
| Metric | Value |
|---|---|
| Total developers | 921 |
| Total repositories | 1,000 |
| Total stars | 18,317 |
| Total forks | 3,665 |
| Data collected | March 2026 |
| Search locations | Peru, Lima, Arequipa, Trujillo, Cusco |
| Rate limiting strategy | Exponential backoff with tenacity |
- Overview Dashboard — Key ecosystem stats, top 10 developers by impact score, industry distribution, top repositories
- Developer Explorer — Searchable/filterable table with all metrics, CSV export
- Repository Browser — Filter by industry, language, stars; view classification confidence and reasoning
- Industry Analysis — CIIU distribution charts, top repos per industry, developer specialization
- Language Analytics — Language distribution, top developers per language, Language × Industry heatmap
| Page | Screenshot |
|---|---|
| Overview | ![]() |
| Developers | ![]() |
| Repositories | ![]() |
| Industries | ![]() |
| Languages | ![]() |
- Python 3.10+
- PostgreSQL running locally
- GitHub Personal Access Token
- OpenAI API Key
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/github-peru-analytics.git
cd github-peru-analytics
# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Configure environment variables
cp .env.example .env
# Edit .env and fill in your tokens and DATABASE_URL
# 5. Run setup (validates env + creates DB tables)
python setup_project.py- Go to GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)
- Click "Generate new token (classic)"
- Scopes:
public_repo,read:user - Copy the token into your
.envfile asGITHUB_TOKEN
- Go to platform.openai.com/api-keys
- Create a new key and add it to
.envasOPENAI_API_KEY
# Step 1: Extract data from GitHub (1000+ repos)
python scripts/extract_data.py
# Step 2: Classify repos into CIIU industries using GPT-4o-mini
python scripts/classify_repos.py
# Step 3: Calculate user and ecosystem metrics
python scripts/calculate_metrics.py
# Step 4: Launch the dashboard
streamlit run app/main.py
# Optional: Run the AI Classification Agent demo
python scripts/run_agent.py| Metric | Formula | Description |
|---|---|---|
total_repos |
COUNT(repos) | Number of owned public repos |
total_stars_received |
SUM(stars) | Total stars across all repos |
total_forks_received |
SUM(forks) | Total forks across all repos |
avg_stars_per_repo |
stars / repos | Average popularity per repo |
account_age_days |
today − created_at | Days since account creation |
repos_per_year |
repos / (age / 365) | Repository creation rate |
follower_ratio |
followers / following | Influence ratio |
h_index |
h repos with ≥ h stars | GitHub h-index |
impact_score |
stars + forks×2 + followers | Composite influence score |
language_diversity |
COUNT(unique languages) | Technical breadth |
has_readme_pct |
repos_with_readme / total | Documentation quality |
has_license_pct |
repos_with_license / total | Professionalism indicator |
is_active |
last_push < 90 days | Active status |
| Metric | Value | Description |
|---|---|---|
total_developers |
921 | Unique Peruvian developers |
total_repositories |
1,000 | Total repos collected |
total_stars |
18,317 | Sum of all stars |
total_forks |
3,665 | Sum of all forks |
avg_repos_per_user |
1.09 | Average repos per developer |
avg_account_age_days |
2,896 | ~7.9 years average tenure |
active_developer_pct |
1.41% | Active in last 90 days |
top_language |
JavaScript (161) | Most used language |
top_industry |
J — Information & Communication | Dominant industry |
The agent autonomously classifies repositories into 21 CIIU industry categories using a multi-step reasoning process.
Architecture:
Repository info → Agent decides if more context needed
↓
[get_readme tool] ← if description is vague
[get_languages tool] ← if tech stack unclear
↓
classify_industry tool → Final result
Tools available:
| Tool | Description |
|---|---|
get_readme(owner, repo) |
Fetches README content (up to 3,000 chars) |
get_languages(owner, repo) |
Gets language breakdown in bytes |
classify_industry(...) |
Submits final classification with reasoning |
Requirements met:
- ✅ Autonomy — makes decisions without human intervention
- ✅ Tool use — uses at least 2 different tools
- ✅ Reasoning — explains every classification decision
- ✅ Error handling — fallback to J on failures
- ✅ Logging — full log in
logs/agent_classification.log
Example agent run:
🤖 Agent starting: dcajasn/Riskfolio-Lib
→ Tool call: get_readme({'owner': 'dcajasn', 'repo': 'Riskfolio-Lib'})
→ Tool call: classify_industry({'industry_code': 'K', 'confidence': 'high',
'reasoning': 'Portfolio optimization library for quantitative finance...'})
✅ Classified as K (Financial & Insurance) [high]
Full agent run log: data/metrics/agent_run_log.json
-
Location bias: GitHub users without a location set are excluded, which likely undercounts the real number of Peruvian developers significantly.
-
Star bias: The top-1,000-by-stars strategy overrepresents popular or older projects and may miss newer or less-starred talent.
-
Classification accuracy: Generic repositories (utilities, hello-world, course homework) are defaulted to J (Information & Communication), which inflates that category to 67.3%.
-
Low active developer rate (1.41%): This is likely caused by the star-based collection strategy, which captures older repos that are no longer maintained rather than currently active projects.
-
Language detection: GitHub's primary language only shows the dominant language, missing truly polyglot repositories.
[Santiago Miguel Maldonado Vizcarra] Course: Prompt Engineering Institution: [Pontificia Universidad Católica del Perú] Date: March 2026





