A neural network-based classification model that predicts K-Pop girl group generations based on song lyrics analysis.
This project applies machine learning techniques to analyze K-Pop girl group lyrics and predict which generation a song belongs to. The model achieves ~90% accuracy in classifying songs into five distinct generations.
Following the reference table (韩国女团世代表_一代至五代.xlsx):
| Generation | Years | Representative Artists |
|---|---|---|
| Gen 1 (一代) | 1996-2002 | S.E.S, Fin.K.L, Baby V.O.X, Jewelry |
| Gen 2 (二代) | 2003-2009 | Girls' Generation, Wonder Girls, KARA, 2NE1, f(x) |
| Gen 3 (三代) | 2010-2013 | SISTAR, Apink, EXID, Miss A, AOA |
| Gen 4 (四代) | 2014-2017 | TWICE, BLACKPINK, Red Velvet, MAMAMOO, GFriend |
| Gen 5 (五代) | 2018+ | IZ*ONE, ITZY, aespa, IVE, (G)I-DLE, NewJeans, LE SSERAFIM |
| Item | Description |
|---|---|
| Source | Kpop-lyric-datasets |
| Original Size | 25,696 K-Pop songs from Melon Monthly Chart (2000-2023) |
| Filtered Dataset | 3,243 girl group songs |
| Cleaned Dataset | ~2,967 songs after data cleaning |
| File | girlgroup_songs.csv |
generation: Girl group generation (一代 ~ 五代)artist: Artist namesong_name: Song titlelyrics: Full lyrics textyear,month: Chart appearance daterank: Chart ranking (1-100)lyrics_length: Character count of lyrics
| Category | Tools |
|---|---|
| Language | Python 3.8+ |
| ML Framework | scikit-learn |
| Data Processing | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Text Processing | TF-IDF Vectorization |
├── kpop_lyrics_analysis.py # Main analysis script
├── girlgroup_songs.csv # Dataset (3,243 songs)
├── model_results.png # Visualization output
├── CA6000_Report_Final.docx # Assignment report
└── README.md
pip install pandas numpy scikit-learn matplotlib seabornpython kpop_lyrics_analysis.py============================================================
K-Pop Girl Group Lyrics Analysis
Neural Network-based Generation Prediction Model
============================================================
SECTION 1: Data Import and Initial Inspection
Dataset Shape: (3243, 14)
...
SECTION 6: The Accuracy of the Eventual Model
========================================
OVERALL TEST ACCURACY: ~90%
========================================
Multi-Layer Perceptron (MLP) Neural Network
Input Layer (3,000 TF-IDF features)
↓
Hidden Layer 1 (256 neurons + ReLU)
↓
Hidden Layer 2 (128 neurons + ReLU)
↓
Hidden Layer 3 (64 neurons + ReLU)
↓
Output Layer (5 classes - Softmax)
- Optimizer: Adam
- Early Stopping: Enabled (10% validation hold-out)
- Max Iterations: 200
- Train/Test Split: 80/20 (stratified)
| Metric | Value |
|---|---|
| Test Accuracy | ~90% |
| Macro F1-Score | ~0.90 |
| Best Validation Score | ~0.95 |
| Generation | Precision | Recall | F1-Score |
|---|---|---|---|
| Gen 1 | ~0.80 | ~0.70 | ~0.75 |
| Gen 2 | ~0.90 | ~0.95 | ~0.92 |
| Gen 3 | ~0.90 | ~0.85 | ~0.87 |
| Gen 4 | ~0.95 | ~0.95 | ~0.95 |
| Gen 5 | ~0.95 | ~0.95 | ~0.95 |
- Gen 4 & Gen 5 achieved highest accuracy - Modern groups have distinctive lyrical patterns with more English content
- Gen 2 showed strong performance - Largest sample size (34%) with clear characteristics
- Gen 1 had lowest recall - Smallest sample size (6.9%) and stylistic overlap with Gen 2 ballads
- The model successfully learned generation-specific vocabulary and linguistic patterns
- Dataset: EX3exp/Kpop-lyric-datasets
- Generation reference: 韩国女团世代表_一代至五代.xlsx
- AI coding assistance: Claude (Anthropic)
This project is for educational purposes as part of CA6000 coursework.