Skip to content

smaldonav29/github-peru-analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇵🇪 GitHub Peru Analytics: Developer Ecosystem Dashboard

A data analytics platform that extracts, processes, and visualizes information about the Peruvian developer ecosystem using the GitHub API, GPT-4o-mini classification, and an interactive Streamlit dashboard.


🚀 Easter Egg

Before starting, run this in Python:

import antigravity

Antigravity Easter Egg


📊 Key Findings

  1. JavaScript dominates: With 161 repositories, JavaScript is by far the most popular language in Peru's developer ecosystem, followed by Python (87) and CSS (60).

  2. Information & Communication rules: 67.3% of all repositories (673 out of 1,000) fall under the Information & Communication industry (CIIU code J), reflecting Peru's strong software development culture.

  3. Education is second: 11.5% of repos (115) are classified under Education (P), showing significant activity in EdTech and learning platforms.

  4. Top developer is devaige: With an impact score of 9,169 (5,513 stars + 1,178 followers), devaige leads the Peruvian GitHub ecosystem, primarily through Android UI libraries.

  5. Most starred repo is financial: dcajasn/Riskfolio-Lib — a Portfolio Optimization library in Python — leads with 3,804 stars, showing strong quantitative finance activity from Peru.


🗂️ Data Collection

Metric Value
Total developers 921
Total repositories 1,000
Total stars 18,317
Total forks 3,665
Data collected March 2026
Search locations Peru, Lima, Arequipa, Trujillo, Cusco
Rate limiting strategy Exponential backoff with tenacity

✨ Features

  • Overview Dashboard — Key ecosystem stats, top 10 developers by impact score, industry distribution, top repositories
  • Developer Explorer — Searchable/filterable table with all metrics, CSV export
  • Repository Browser — Filter by industry, language, stars; view classification confidence and reasoning
  • Industry Analysis — CIIU distribution charts, top repos per industry, developer specialization
  • Language Analytics — Language distribution, top developers per language, Language × Industry heatmap

Screenshots

Page Screenshot
Overview Overview
Developers Developers
Repositories Repositories
Industries Industries
Languages Languages

⚙️ Installation

Prerequisites

  • Python 3.10+
  • PostgreSQL running locally
  • GitHub Personal Access Token
  • OpenAI API Key

Steps

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/github-peru-analytics.git
cd github-peru-analytics

# 2. Create a virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment variables
cp .env.example .env
# Edit .env and fill in your tokens and DATABASE_URL

# 5. Run setup (validates env + creates DB tables)
python setup_project.py

GitHub Token Setup

  1. Go to GitHub → Settings → Developer Settings → Personal Access Tokens → Tokens (classic)
  2. Click "Generate new token (classic)"
  3. Scopes: public_repo, read:user
  4. Copy the token into your .env file as GITHUB_TOKEN

OpenAI Key Setup

  1. Go to platform.openai.com/api-keys
  2. Create a new key and add it to .env as OPENAI_API_KEY

🏃 Usage

# Step 1: Extract data from GitHub (1000+ repos)
python scripts/extract_data.py

# Step 2: Classify repos into CIIU industries using GPT-4o-mini
python scripts/classify_repos.py

# Step 3: Calculate user and ecosystem metrics
python scripts/calculate_metrics.py

# Step 4: Launch the dashboard
streamlit run app/main.py

# Optional: Run the AI Classification Agent demo
python scripts/run_agent.py

📐 Metrics Documentation

User-Level Metrics

Metric Formula Description
total_repos COUNT(repos) Number of owned public repos
total_stars_received SUM(stars) Total stars across all repos
total_forks_received SUM(forks) Total forks across all repos
avg_stars_per_repo stars / repos Average popularity per repo
account_age_days today − created_at Days since account creation
repos_per_year repos / (age / 365) Repository creation rate
follower_ratio followers / following Influence ratio
h_index h repos with ≥ h stars GitHub h-index
impact_score stars + forks×2 + followers Composite influence score
language_diversity COUNT(unique languages) Technical breadth
has_readme_pct repos_with_readme / total Documentation quality
has_license_pct repos_with_license / total Professionalism indicator
is_active last_push < 90 days Active status

Ecosystem Metrics

Metric Value Description
total_developers 921 Unique Peruvian developers
total_repositories 1,000 Total repos collected
total_stars 18,317 Sum of all stars
total_forks 3,665 Sum of all forks
avg_repos_per_user 1.09 Average repos per developer
avg_account_age_days 2,896 ~7.9 years average tenure
active_developer_pct 1.41% Active in last 90 days
top_language JavaScript (161) Most used language
top_industry J — Information & Communication Dominant industry

🤖 AI Agent Documentation

Classification Agent (Option B)

The agent autonomously classifies repositories into 21 CIIU industry categories using a multi-step reasoning process.

Architecture:

Repository info → Agent decides if more context needed
                      ↓
              [get_readme tool]    ← if description is vague
              [get_languages tool] ← if tech stack unclear
                      ↓
              classify_industry tool → Final result

Tools available:

Tool Description
get_readme(owner, repo) Fetches README content (up to 3,000 chars)
get_languages(owner, repo) Gets language breakdown in bytes
classify_industry(...) Submits final classification with reasoning

Requirements met:

  • ✅ Autonomy — makes decisions without human intervention
  • ✅ Tool use — uses at least 2 different tools
  • ✅ Reasoning — explains every classification decision
  • ✅ Error handling — fallback to J on failures
  • ✅ Logging — full log in logs/agent_classification.log

Example agent run:

🤖 Agent starting: dcajasn/Riskfolio-Lib
  → Tool call: get_readme({'owner': 'dcajasn', 'repo': 'Riskfolio-Lib'})
  → Tool call: classify_industry({'industry_code': 'K', 'confidence': 'high',
      'reasoning': 'Portfolio optimization library for quantitative finance...'})
  ✅ Classified as K (Financial & Insurance) [high]

Full agent run log: data/metrics/agent_run_log.json


⚠️ Limitations

  1. Location bias: GitHub users without a location set are excluded, which likely undercounts the real number of Peruvian developers significantly.

  2. Star bias: The top-1,000-by-stars strategy overrepresents popular or older projects and may miss newer or less-starred talent.

  3. Classification accuracy: Generic repositories (utilities, hello-world, course homework) are defaulted to J (Information & Communication), which inflates that category to 67.3%.

  4. Low active developer rate (1.41%): This is likely caused by the star-based collection strategy, which captures older repos that are no longer maintained rather than currently active projects.

  5. Language detection: GitHub's primary language only shows the dominant language, missing truly polyglot repositories.


📎 Video

Demo Video Link


👤 Author

[Santiago Miguel Maldonado Vizcarra] Course: Prompt Engineering Institution: [Pontificia Universidad Católica del Perú] Date: March 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages