Team: Kevin Li 🐐, Tommy Ou, Ronnie Wang, Kev Xue
A web application for forensic exploration of what makes a song a hit, built on a join between 65 years of Billboard Hot 100 chart history (1958–2021) and Spotify audio features for 447,000+ tracks. The database does the analytical work; the app is a thin interactive surface over it.
Every page is backed by non-trivial SQL — multi-CTE chains, window functions, self-joins on a junction table, and a GiST-indexed cube operator for k-NN audio similarity.
hit-forensics/
├── README.md
│
├── db/ # Postgres artifacts
│ ├── queries.sql # All 10 production queries (Q1–Q10)
│ ├── optimization.sql # Indexes, materialized view, cube extension DDL
│ ├── timing.sql # EXPLAIN ANALYZE benchmarks (pre vs post optimization)
│ └── preprocessing.py # Data cleaning + entity resolution pipeline
│
├── server/ # Express + pg backend
│ ├── index.js # Entry point
│ ├── app.js # Express app + route mounting
│ ├── db.js # Postgres connection pool
│ ├── routes/
│ │ ├── tracks.js # Q1, Q2, Q4, Q6, Q7, Q9 routes
│ │ ├── artists.js # Q3, Q10 + artist search
│ │ └── charts.js # Q5, Q8 routes
│ └── tests/ # Vitest + Supertest route tests
│
└── client/ # React + Vite frontend
├── index.html
├── vite.config.js
└── src/
├── main.jsx # Routing + nav layout
├── api.js # Fetch wrapper for all API endpoints
├── styles.css # Editorial design system (per DESIGN.md)
└── pages/
├── HomePage.jsx
├── SearchPage.jsx
├── TrackPage.jsx # Song Spotlight (radar + trajectory + twins)
├── WorkbenchPage.jsx # Q7 audio feature filter
├── TrajectoryPage.jsx # Q8 archetype gallery
├── EraDecoderPage.jsx # Q9 decade outliers
├── ChemistryPage.jsx # Q10 collab leaderboard
├── TopChartsPage.jsx # Q5 longest-charting
└── ArtistPage.jsx # Q3 artist discography
| Layer | Technology | Notes |
|---|---|---|
| Database | PostgreSQL 17 on AWS RDS | Extensions: pg_trgm, cube |
| Backend | Node.js 20+, Express 4, pg 8 |
ES modules, three route files |
| Frontend | React 18, React Router 6, Vite 5 | Component-driven, client-side routing |
| Charts | Recharts 3 | Radar, line, bar |
| Testing | Vitest 2, Supertest, React Testing Library | DB pool and api module mocked |
| Data pipeline | Python 3, pandas | Entity resolution and dedup |
| Dataset | Source | Rows |
|---|---|---|
| Spotify Tracks (1921–2020) | Kaggle | ~587k raw |
| Billboard Hot 100 (1958–2021) | Kaggle | ~330k weekly entries |
After cleaning, deduplication, and entity resolution: ~447k canonical tracks and ~230k chart entries. See the final report for the entity-resolution methodology.
- Node 20+ and npm
- Access to the
music_dbPostgres instance on AWS RDS — credentials shared privately - Python 3 + pandas (only if you want to re-run the data pipeline)
cd server
cp .env.example .env # then fill in PGHOST / PGUSER / PGPASSWORDThe frontend doesn't need DB credentials — client/.env.example only sets
VITE_API_URL and already defaults to http://localhost:4000.
The optimized queries depend on indexes, a materialized view
(mv_charted_tracks), and a cube column (audio_cube) that aren't in the
freshly-populated schema. Apply them once:
psql -h <RDS-host> -U <user> -d music_db -f db/optimization.sqlThe script is idempotent (IF NOT EXISTS everywhere) and adds:
pg_trgmGIN indexes ontracks.track_nameandartists.artist_name(Q1)- B-tree indexes on
chart_performance(track_id, week_date)(Q4, Q8) - Multi-column index on
tracks(danceability, energy, valence, tempo)(Q7) - Partial index on
track_artists(track_id) WHERE is_primary(every route) mv_charted_tracksmaterialized view aggregatingchart_performanceper trackcubeextension +audio_cubecolumn + GiST index for Q6 k-NN search
The live RDS already has all of this applied — only re-run after a bulk reload. If
chart_performanceis reloaded, refresh the materialized view:REFRESH MATERIALIZED VIEW CONCURRENTLY mv_charted_tracks;
# Terminal 1 — backend
cd server
npm install
npm run dev # http://localhost:4000
# Terminal 2 — frontend
cd client
cp .env.example .env # if you haven't already
npm install
npm run dev # http://localhost:5173Health check: curl http://localhost:4000/api/health should return
{"ok":true}.
| Method | Path | Query | Purpose |
|---|---|---|---|
| GET | /api/health |
— | Liveness check |
| GET | /api/tracks/search |
q |
Q1 fuzzy track search |
| GET | /api/tracks/workbench |
danceability_min/max, energy_min/max, valence_min/max, tempo_min/max, limit |
Q7 multi-feature filter with NTILE decade percentiles |
| GET | /api/tracks/outliers |
decade?, limit |
Q9 decade-adjusted z-score outliers |
| GET | /api/tracks/:id |
— | Q2 track detail with primary artist |
| GET | /api/tracks/:id/chart |
— | Q4 weekly chart trajectory |
| GET | /api/tracks/:id/similar |
limit |
Q6 audio twin finder (cube <->) |
| GET | /api/artists/search |
q |
Artist name search |
| GET | /api/artists/chemistry |
limit |
Q10 collab chemistry leaderboard |
| GET | /api/artists/:id |
— | Q3 artist detail + discography |
| GET | /api/charts/top |
limit |
Q5 longest-charting tracks |
| GET | /api/charts/trajectories |
— | Q8 archetype classification + avg audio profile |
| Route | Page | Backing query |
|---|---|---|
/ |
Home | — (static feature grid) |
/search |
Track search | Q1 |
/track/:id |
Song Spotlight (radar fingerprint + chart trajectory + audio twins) | Q2 + Q4 + Q6 |
/workbench |
Audio Feature Workbench | Q7 |
/trajectories |
Trajectory Gallery (5 archetypes, overlaid radar) | Q8 |
/era-decoder |
Decade outlier table with z-score color coding | Q9 |
/chemistry |
Collab pair leaderboard | Q10 |
/top-charts |
Longest-charting tracks | Q5 |
/artist/:id |
Artist discography with chart stats | Q3 |
Both packages use Vitest. Server tests use Supertest
with the pg pool mocked — they never touch the real database. Client tests
use React Testing Library with the api module mocked.
cd server && npm test # route tests
cd client && npm test # component testsdb/timing.sql contains pre- and post-optimization versions of the four
complex queries (Q6, Q7, Q9, Q10) plus Q1 and Q8, side by side with
EXPLAIN ANALYZE. Run from psql:
psql -h <RDS-host> -U <user> -d music_db -f db/timing.sqlHeadline post-optimization speedups: ~22× on Q6 (cube + GiST replaces brute-force Euclidean) and ~31× on Q9 (materialized view replaces inline CTE). Full timings and analysis are in the final report.
The raw CSVs aren't checked in. To re-run the pipeline, drop them into a
data/ folder at the repo root:
data/
├── tracks.csv # Spotify
├── artists.csv # Spotify
└── charts.csv # Billboard
Then run from the repo root:
python db/preprocessing.pyIt writes Tracks.csv, Artists.csv, Track_Artists.csv, and
Chart_Performance.csv to the current directory, ready for \copy into
Postgres. The pipeline:
- Normalizes track and artist names — strips parentheticals/remix markers, removes accents via NFKD, drops punctuation.
- Deduplicates Spotify tracks on
(normalized_title, normalized_primary_artist), keeping the highest-popularity row. - Resolves Billboard ↔ Spotify entities via bi-directional subset match on normalized artist token sets.
- Populates
Track_Artistswithis_primary = TRUEon the lead artist for each track.