The project proposes a hybrid Vision Transformer (ViT) and Bidirectional LSTM (BiLSTM) model with an attention-based fusion mechanism to accurately classify the degree of foot-ground contact during the long jump, using video captured at only 25 frames per second.
Highlights: Achieved 91.87% classification accuracy and 8.18 ms/frame processing speed on a resource-constrained GPU (8G VRAM, 321 TOPS).
- Low-Frame-Rate Analysis: Designed to work effectively with videos captured at standard frame rates, overcoming motion blur issues.
- Fine-Grained Classification: Classifies foot contact into 5 distinct labels (0: No Contact, 1-3: Progressive Ground Contact Stages, 4: Sandpit Contact), offering more detail than binary classification.
- Hybrid ViT-LSTM Architecture: Combines the spatial feature extraction power of Vision Transformers with the temporal modeling capabilities of LSTMs.
- Attention-Based Fusion: Fuses visual features (from cropped ankle images) and 2D pose data using an attention mechanism to focus on relevant information.
- Efficient Processing: Optimized for performance, achieving fast processing speeds even on hardware with limited computational resources.
- Robust Training Strategy: Employs pose normalization, data augmentation, 5-fold cross-validation, and weighted cross-entropy loss to handle data limitations and class imbalance.
Distribution of 19 joint points:
Model Architecture:
Environment: This project was developed using Python 3.10.16 and Pytorch 2.6.0+cu118. You can choose the version of PyTorch that suits your GPU driver.
Clone the repository:
git clone https://github.com/fangevo/ViT-LSTM-Foot-Contact-Detection.git
cd ViT-LSTM-Foot-Contact-DetectionDataset (Frame sequences extracted from 30 video clips): https://drive.google.com/file/d/13hf_kXzegg2eVV8V31Rg6dn1gqT6wMtb/view?usp=sharing
Put the data folder in ./
Weight: https://drive.google.com/file/d/1fAFRAi2CZWLprRo158a0964dfXdIXCnX/view?usp=drive_link
Put the model weight file in ./weight/
Train:
python main.py --mode trainPrediction:
python main.py --mode predictSome useful tools: The scripts in the utilis folder include a visual annotation tool, confusion matrix computation, ankle image cropping, and pose normalization. Using these scripts requires manually modifying the file paths.
@misc{fang:hal-05090038,
TITLE = {{Computer vision-based foot contact detection for long jump using a monocular normal-speed camera}},
AUTHOR = {Fang, Yangtao and Gan, Qi and Nguyen, Sao Mai},
URL = {https://hal.science/hal-05090038},
NOTE = {Poster},
HOWPUBLISHED = {{Journ{\'e}e commune EGC/AFIA Gestion et Analyse de donn{\'e}es Sportives (GAS'25)}},
ORGANIZATION = {{Nida Meddouri and Albrecht Zimmermann and Cl{\'e}ment Iphar and Aur{\'e}lie Leborgne and Lo{\"i}c Salmon}},
YEAR = {2025},
MONTH = May,
PDF = {https://hal.science/hal-05090038v1/file/GAS%2725_GAST_Fang_et_al.pdf},
HAL_ID = {hal-05090038},
HAL_VERSION = {v1},
}