GitHub Project Metadata Explorer

Overview

We built this tool to help analyze and explore classroom project repositories hosted on GitHub. These are usually final projects from students, and we often want to understand what topics they worked on, which technologies they used, and how students collaborated. Instead of going through each repository manually, this tool gives us a way to collect all the relevant information and view it in a searchable, visual format.

This project uses PostgreSQL for storing data and searching through it using full-text search. We also show how students, libraries, and project topics are connected using an interactive graph.

Goals and Features

Store all project metadata in one place
Extract project titles, team members, libraries used, and key phrases from READMEs
Normalize and clean text for better searching
Use PostgreSQL full-text search to quickly find relevant projects
Let users search by keyword, year, library, or other fields defined in config.yaml
Visualize connections between students and the tools they used with a network graph
Fully configure UI and data behavior via a single config.yaml file

Project Structure

project-root/
│
├── .env
├── .gitignore
├── app.py
├── config.yaml
├── Dockerfile
├── README.md
├── requirements.txt
├── run_etl.sh
├── logs\
├── project_utils\
│   ├── cloned_repos\
│   ├── data\
│   │   ├── project_data.json
│   │   └── semesters.csv
│   ├── logs\
│   │   └── project_parser.log
│   ├── postgres_schema\
│   │   └── projects_table.sql
│   ├── db.py
│   ├── github_utils.py
│   ├── postgres_uploader.py
│   ├── readme_parser.py
│   └── starter_class.py
├── src\
│   ├── dao.py
│   ├── renderer.py
│   └── service.py

How Each Part Works

Clone GitHub Repos (scripts/generate_project_metadata.py)
- Reads data/semesters.csv from config.
- Uses github_utils.py to sparse-clone only required files (README.md, .py).
Parse and Extract Metadata
- readme_parser.py extracts fields defined under extract_sections in config.yaml (e.g., title, team_members).
- preprocess.py normalizes text: lowercasing, stopword removal, punctuation stripping, whitespace cleanup.
Load into PostgreSQL
- Schema defined in project_utils/db_setup.sql.
- ETL script writes cleaned records and arrays (libraries) into the projects table.
- A search_vector column uses to_tsvector() and a GIN index for fast full-text queries.
DAO Layer (project_utils/dao.py)
- Reads filters and display_columns from config.yaml.
- Builds dynamic SELECT clause for requested fields.
- Builds WHERE clauses based on non-empty filters (keyword, year, team_members, libraries, etc.).
- Executes SQL and returns list of dicts.
Streamlit Frontend (app.py)
- Loads config via starter_class.
- Renders dynamic filters (text inputs or dropdowns) based on filters section.
- Calls DAO.search_projects(...) with filter_inputs, select_fields, default_limit.
- Converts results to DataFrame and passes to generate_styled_html().
- Displays styled HTML table and interactive graph from graph_utils.py.

Configuration (`config.yaml`)

All behavior is driven by config.yaml, including:

Database credentials (postgres section)
Fields to display (display_columns section: field, label, max_width, link, styles)
Filters (filters section: enabled, label, field, type, options)
Extract sections (extract_sections for README parsing)
UI text (app.title, search_input_text, no_results_text, etc.)
Defaults (default_limit, readme_lines_to_scan, default_column_width)

Flow Diagram (UML Sequence)

sequenceDiagram
    participant U as User
    participant A as Streamlit App (app.py)
    participant C as Config Loader
    participant D as DAO (dao.py)
    participant DB as PostgreSQL
    participant R as Renderer (renderer.py)

    U->>A: load page
    A->>C: load config.yaml
    C-->>A: return config dict
    A->>U: render filters
    U->>A: set filters and click Search
    A->>D: search_projects(filters, select_fields, limit)
    D->>DB: execute dynamic SQL
    DB-->>D: return rows
    D-->>A: return results list
    A->>R: generate_styled_html(results)
    R-->>A: HTML table + graph
    A->>U: display table & graph

Setup & Usage

Generate Metadata & Load DB

python scripts/generate_project_metadata.py --config config.yaml
psql -U <user> -d <db> -f project_utils/db_setup.sql

Run Streamlit App
```
streamlit run app.py
```
Access at http://localhost:8501

Example Searches

monte carlo (simulation)
queue (queuing models)
pandas (data analysis)

Future Ideas

Semantic search with embedding vectors
CSV/Excel export of results
Trend analysis of libraries over semesters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Project Metadata Explorer

Overview

Goals and Features

Project Structure

How Each Part Works

Configuration (`config.yaml`)

Flow Diagram (UML Sequence)

Setup & Usage

Example Searches

Future Ideas

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
project_utils		project_utils
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
config.yaml		config.yaml
config.yaml.tpl		config.yaml.tpl
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt
run_etl.sh		run_etl.sh

Folders and files

Latest commit

History

Repository files navigation

GitHub Project Metadata Explorer

Overview

Goals and Features

Project Structure

How Each Part Works

Configuration (config.yaml)

Flow Diagram (UML Sequence)

Setup & Usage

Example Searches

Future Ideas

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration (`config.yaml`)

Packages