We built this tool to help analyze and explore classroom project repositories hosted on GitHub. These are usually final projects from students, and we often want to understand what topics they worked on, which technologies they used, and how students collaborated. Instead of going through each repository manually, this tool gives us a way to collect all the relevant information and view it in a searchable, visual format.
This project uses PostgreSQL for storing data and searching through it using full-text search. We also show how students, libraries, and project topics are connected using an interactive graph.
- Store all project metadata in one place
- Extract project titles, team members, libraries used, and key phrases from READMEs
- Normalize and clean text for better searching
- Use PostgreSQL full-text search to quickly find relevant projects
- Let users search by keyword, year, library, or other fields defined in
config.yaml - Visualize connections between students and the tools they used with a network graph
- Fully configure UI and data behavior via a single
config.yamlfile
project-root/
│
├── .env
├── .gitignore
├── app.py
├── config.yaml
├── Dockerfile
├── README.md
├── requirements.txt
├── run_etl.sh
├── logs\
├── project_utils\
│ ├── cloned_repos\
│ ├── data\
│ │ ├── project_data.json
│ │ └── semesters.csv
│ ├── logs\
│ │ └── project_parser.log
│ ├── postgres_schema\
│ │ └── projects_table.sql
│ ├── db.py
│ ├── github_utils.py
│ ├── postgres_uploader.py
│ ├── readme_parser.py
│ └── starter_class.py
├── src\
│ ├── dao.py
│ ├── renderer.py
│ └── service.py
-
Clone GitHub Repos (
scripts/generate_project_metadata.py)- Reads
data/semesters.csvfrom config. - Uses
github_utils.pyto sparse-clone only required files (README.md,.py).
- Reads
-
Parse and Extract Metadata
readme_parser.pyextracts fields defined underextract_sectionsinconfig.yaml(e.g., title, team_members).preprocess.pynormalizes text: lowercasing, stopword removal, punctuation stripping, whitespace cleanup.
-
Load into PostgreSQL
- Schema defined in
project_utils/db_setup.sql. - ETL script writes cleaned records and arrays (libraries) into the
projectstable. - A
search_vectorcolumn usesto_tsvector()and a GIN index for fast full-text queries.
- Schema defined in
-
DAO Layer (
project_utils/dao.py)- Reads
filtersanddisplay_columnsfromconfig.yaml. - Builds dynamic
SELECTclause for requested fields. - Builds
WHEREclauses based on non-empty filters (keyword, year, team_members, libraries, etc.). - Executes SQL and returns list of dicts.
- Reads
-
Streamlit Frontend (
app.py)- Loads config via
starter_class. - Renders dynamic filters (text inputs or dropdowns) based on
filterssection. - Calls
DAO.search_projects(...)withfilter_inputs,select_fields,default_limit. - Converts results to DataFrame and passes to
generate_styled_html(). - Displays styled HTML table and interactive graph from
graph_utils.py.
- Loads config via
All behavior is driven by config.yaml, including:
- Database credentials (
postgressection) - Fields to display (
display_columnssection: field, label, max_width, link, styles) - Filters (
filterssection: enabled, label, field, type, options) - Extract sections (
extract_sectionsfor README parsing) - UI text (app.title, search_input_text, no_results_text, etc.)
- Defaults (
default_limit,readme_lines_to_scan,default_column_width)
sequenceDiagram
participant U as User
participant A as Streamlit App (app.py)
participant C as Config Loader
participant D as DAO (dao.py)
participant DB as PostgreSQL
participant R as Renderer (renderer.py)
U->>A: load page
A->>C: load config.yaml
C-->>A: return config dict
A->>U: render filters
U->>A: set filters and click Search
A->>D: search_projects(filters, select_fields, limit)
D->>DB: execute dynamic SQL
DB-->>D: return rows
D-->>A: return results list
A->>R: generate_styled_html(results)
R-->>A: HTML table + graph
A->>U: display table & graph
- Generate Metadata & Load DB
python scripts/generate_project_metadata.py --config config.yaml psql -U <user> -d <db> -f project_utils/db_setup.sql
- Run Streamlit App
streamlit run app.py
- Access at
http://localhost:8501
monte carlo(simulation)queue(queuing models)pandas(data analysis)
- Semantic search with embedding vectors
- CSV/Excel export of results
- Trend analysis of libraries over semesters