This project classifies segmented ECG heartbeat signals from the Kaggle ECG Heartbeat Categorization Dataset. The main experiment compares classical machine learning on SciPy-derived signal features against a compact 1-D CNN trained directly on raw heartbeat segments.
The best model in this run is a random forest on handcrafted signal features:
| Model | Test accuracy | Test macro F1 | Test weighted F1 |
|---|---|---|---|
| Random Forest | 0.9710 | 0.8586 | 0.9690 |
| 1-D CNN | 0.9528 | 0.8205 | 0.9537 |
| XGBoost | 0.9387 | 0.7933 | 0.9442 |
| Linear SVM | 0.9036 | 0.6400 | 0.9087 |
| Logistic Regression | 0.7131 | 0.5262 | 0.7797 |
The high accuracy is not enough by itself. Class 0 dominates the dataset, so macro F1 and per-class recall are the main metrics.
Dataset: ECG Heartbeat Categorization Dataset
The Kaggle download contains both MIT-BIH-derived and PTBDB-derived CSV files. This project uses only:
data/raw/mitbih_train.csvdata/raw/mitbih_test.csv
The PTBDB files are left out because they are a separate binary normal/abnormal task and should not be mixed into the MIT-BIH 5-class classifier.
The MIT-BIH files contain 188 columns per row: 187 heartbeat samples and one label column.
| Class | Kaggle code | Train | Validation | Test |
|---|---|---|---|---|
| 0 | N | 57,977 | 14,494 | 18,118 |
| 1 | S | 1,778 | 445 | 556 |
| 2 | V | 4,630 | 1,158 | 1,448 |
| 3 | F | 513 | 128 | 162 |
| 4 | Q | 5,145 | 1,286 | 1,608 |
Class 3 is less than 1% of the data, so it is the hardest class to evaluate reliably.
The waveform values are already scaled to [0, 1]. Around 41% of waveform entries are zero, which is consistent with fixed-length segmented beats containing padded or low-amplitude regions.
The Kaggle test file is kept untouched as the final hold-out set. The Kaggle training file is split into train and validation using stratified sampling:
- train: 70,043 rows
- validation: 17,511 rows
- test: 21,892 rows
The train/validation split indices are saved to data/processed/ for leakage checks.
The classical models use 29 features extracted after a light Butterworth band-pass filter:
- time-domain: mean, standard deviation, extrema, quantiles, skew, kurtosis, energy, zero-crossing rate
- frequency-domain: FFT band-power ratios, spectral centroid, rolloff, dominant frequency, spectral entropy
- shape features: main peak position, peak height, prominence, width, and short-lag autocorrelation
Feature arrays are saved as:
data/processed/features_train.npydata/processed/features_val.npydata/processed/features_test.npy
Classical models:
- Logistic Regression
- Linear SVM
- Random Forest
- XGBoost
Deep learning baseline:
- compact 1-D CNN with class-weighted cross entropy
- best checkpoint selected by validation macro F1
The CNN originally over-weighted the rarest class with full inverse-frequency weights. That produced high class 3 recall but too many false positives. The final CNN uses square-root class weights, which was more balanced.
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| 0 | 0.9705 | 0.9982 | 0.9842 | 18,118 |
| 1 | 0.9688 | 0.5594 | 0.7092 | 556 |
| 2 | 0.9591 | 0.8736 | 0.9143 | 1,448 |
| 3 | 0.8632 | 0.6235 | 0.7240 | 162 |
| 4 | 0.9960 | 0.9291 | 0.9614 | 1,608 |
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| 0 | 0.9833 | 0.9656 | 0.9744 | 18,118 |
| 1 | 0.7402 | 0.5791 | 0.6498 | 556 |
| 2 | 0.7173 | 0.9620 | 0.8218 | 1,448 |
| 3 | 0.6609 | 0.7099 | 0.6845 | 162 |
| 4 | 0.9903 | 0.9540 | 0.9718 | 1,608 |
The random forest on SciPy features is the strongest model overall in this run. The CNN improves recall on class 3, but it loses enough precision and class 1 performance that its macro F1 is lower.
This is a useful result: the classical signal-feature baseline is not just a placeholder. On this segmented benchmark, a well-tuned feature pipeline plus random forest is competitive with a simple neural baseline.
Row-level split checks:
| Check | Value |
|---|---|
| Train/validation index overlap | 0 |
| Train/validation duplicate waveforms | 0 |
| Train/test duplicate waveforms | 0 |
| Validation/test duplicate waveforms | 0 |
Shuffled-label validation checks:
| Model | Accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|
| Shuffled Logistic Regression | 0.8277 | 0.1811 | 0.7497 |
| Shuffled Random Forest | 0.8275 | 0.1815 | 0.7498 |
The shuffled-label accuracy is still high because class 0 is dominant. Macro F1 collapses to about 0.18, which is the relevant sign that the real models are learning label-related structure rather than only exploiting class imbalance.
Place the Kaggle CSV files in data/raw/:
data/raw/mitbih_train.csv
data/raw/mitbih_test.csv
Then run the project stages from the repository root:
python -m src.data.validate_data
python -m src.data.split_data
python -m src.features.extract_features
python -m src.models.classical
python -m src.models.train_cnn
python -m src.models.sanity_checks
python -m src.models.final_reportThe notebooks in notebooks/ are executed analysis artifacts:
01_eda.ipynb02_preprocessing_and_features.ipynb03_classical_ml_models.ipynb04_1d_cnn.ipynb05_final_evaluation.ipynb
ecg-heartbeat-classification/
├── data/
│ ├── raw/
│ ├── processed/
│ └── external/
├── models/
│ ├── checkpoints/
│ └── metrics/
├── notebooks/
├── reports/
│ ├── figures/
│ └── tables/
└── src/
├── data/
├── features/
├── models/
└── visualization/
- The dataset is already segmented into individual beats, so this project does not solve beat detection from continuous ECG.
- The split is not patient-level. Strong beat-level results can overstate generalization if similar patients or recording conditions appear across splits.
- The Kaggle benchmark is derived from MIT-BIH and PTBDB; this project evaluates only the MIT-BIH-derived multiclass files.
- No external cohort was used.
- The class labels are grouped categories, not fine-grained arrhythmia annotations.
- These results are not clinical validation.
The strongest result is random forest on SciPy-derived signal features with test macro F1 of 0.8586. The result passes basic row-level leakage and shuffled-label checks, but minority-class recall remains the main weakness, especially for classes 1 and 3.




