Skip to content

fangevo/ViT-LSTM-Foot-Contact-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ViT-LSTM-Foot-Contact-Detection

Computer vision-based foot contact detection for long jump using a monocular normal-speed camera

Yangtao Fang, Qi Gan, Sao Mai Nguyen
IP Paris, Telecom Paris, ENSTA

License: MIT

The project proposes a hybrid Vision Transformer (ViT) and Bidirectional LSTM (BiLSTM) model with an attention-based fusion mechanism to accurately classify the degree of foot-ground contact during the long jump, using video captured at only 25 frames per second.

Highlights: Achieved 91.87% classification accuracy and 8.18 ms/frame processing speed on a resource-constrained GPU (8G VRAM, 321 TOPS).

Features

  • Low-Frame-Rate Analysis: Designed to work effectively with videos captured at standard frame rates, overcoming motion blur issues.
  • Fine-Grained Classification: Classifies foot contact into 5 distinct labels (0: No Contact, 1-3: Progressive Ground Contact Stages, 4: Sandpit Contact), offering more detail than binary classification.
  • Hybrid ViT-LSTM Architecture: Combines the spatial feature extraction power of Vision Transformers with the temporal modeling capabilities of LSTMs.
  • Attention-Based Fusion: Fuses visual features (from cropped ankle images) and 2D pose data using an attention mechanism to focus on relevant information.
  • Efficient Processing: Optimized for performance, achieving fast processing speeds even on hardware with limited computational resources.
  • Robust Training Strategy: Employs pose normalization, data augmentation, 5-fold cross-validation, and weighted cross-entropy loss to handle data limitations and class imbalance.

Method

Distribution of 19 joint points:

wechat_2025-04-22_183319_969

Label Definition: wechat_2025-04-22_182935_943

Model Architecture: s (1) $B$: batch size, $T$: sequence length, $C$: number of channels, $H$: frame height, $W$: frame width, $F_p$: pose feature dimension, $F_v$: ViT feature dimension, $D_h$: hidden size (per direction), $N$: output classes.

Installation

Environment: This project was developed using Python 3.10.16 and Pytorch 2.6.0+cu118. You can choose the version of PyTorch that suits your GPU driver.

Clone the repository:

    git clone https://github.com/fangevo/ViT-LSTM-Foot-Contact-Detection.git
    cd ViT-LSTM-Foot-Contact-Detection

Dataset (Frame sequences extracted from 30 video clips): https://drive.google.com/file/d/13hf_kXzegg2eVV8V31Rg6dn1gqT6wMtb/view?usp=sharing
Put the data folder in ./

Weight: https://drive.google.com/file/d/1fAFRAi2CZWLprRo158a0964dfXdIXCnX/view?usp=drive_link
Put the model weight file in ./weight/

Train:

    python main.py --mode train

Prediction:

    python main.py --mode predict

Some useful tools: The scripts in the utilis folder include a visual annotation tool, confusion matrix computation, ankle image cropping, and pose normalization. Using these scripts requires manually modifying the file paths.

Citation

@misc{fang:hal-05090038,
  TITLE = {{Computer vision-based foot contact detection for long jump using a monocular normal-speed camera}},
  AUTHOR = {Fang, Yangtao and Gan, Qi and Nguyen, Sao Mai},
  URL = {https://hal.science/hal-05090038},
  NOTE = {Poster},
  HOWPUBLISHED = {{Journ{\'e}e commune EGC/AFIA Gestion et Analyse de donn{\'e}es Sportives (GAS'25)}},
  ORGANIZATION = {{Nida Meddouri and Albrecht Zimmermann and Cl{\'e}ment Iphar and Aur{\'e}lie Leborgne and Lo{\"i}c Salmon}},
  YEAR = {2025},
  MONTH = May,
  PDF = {https://hal.science/hal-05090038v1/file/GAS%2725_GAST_Fang_et_al.pdf},
  HAL_ID = {hal-05090038},
  HAL_VERSION = {v1},
}

About

Official implementation of a hybrid ViT-BiLSTM framework for fine-grained foot contact detection in long jump athletics, specifically optimized for monocular, low-frame-rate video analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages