IsoCourt is a state-of-the-art action recognition and coaching platform designed to analyze badminton footage, provide play-by-play breakdowns, and deliver expert technical tips using deep learning and Large Language Models.
ResNet + LSTM
↓
Biomechanics Rule Engine
↓
Structured Insight Summary
↓
Gemini
IsoCourt operates as a continuous Sliding Window pipeline. The backend loops through the video in 1.5s segments with a 0.75s overlap, ensuring every frame is processed with temporal context and high-fidelity coaching feedback.
Users upload their badminton footage through a sleek, responsive interface. The frontend handles the large file transfers and provides a real-time Match Timeline. Once analysis is complete, it renders a synchronized view of action labels, confidence scores, and frame-by-frame Skeleton Analysis.
The heart of the system processes the video through a sophisticated pipeline:
- Frame Buffering: Loads video segments into memory to ensure stability and high-speed access.
- Action Recognition: Uses the Architecture V2 CNN-LSTM model to identify the stroke type, player position, and technique from 16-frame sequences.
- Pose Estimation: Simultaneously runs MediaPipe to extract skeleton data for visual feedback.
The raw data from the model (e.g., "Backhand Clear, Mid-Back, Quality 4/7") is sent to Gemini 1.5 Flash. The LLM interprets these metrics to provide three personalized, technical coaching tips for every detected segment.
Use this for scripts/ec2/rsync_push.sh / scripts/ec2/rsync_pull.sh: create scripts/ec2.env at the repo root (that file is gitignored). Example:
EC2_HOST=ec2-user@54.202.106.114
KEY_FILE=scripts/navneeth-keys.pemUse an absolute path for KEY_FILE if you prefer (e.g. where your .pem actually lives). Then from the repo root:
./scripts/ec2/rsync_push.sh
./scripts/ec2/rsync_pull.shFull flow (bootstrap, tmux, excludes): scripts/ec2/README.md. On a fresh Ubuntu GPU instance, install the NVIDIA stack and reboot once with scripts/ec2/install_nvidia_driver_reboot.sh before bootstrap_ec2.sh.
If this repository is public or shared, treat the host and key path as sensitive and prefer placeholders in committed docs.
Full write-up of training standards, architecture comparison tables (pose vs no-pose ablations), external baselines (BST, TemPose, ST-GCN, SkateFormer), and novelty framing: docs/RESEARCH.md.
The core movement analysis is powered by a CNN-LSTM Hierarchical Model optimized for temporal badminton actions.
- Base: ResNet-50 ResNet architecture (frozen or fine-tuned).
- Function: Extracts 2048 high-level features per frame.
- Robustness: Trained with aggressive
ColorJitterto be invariant to court color and lighting.
- Module: Dual-layer LSTM with 512 hidden units.
- Fixed-Frame Analysis: Samples 16 frames per hitting event using a sliding window.
- Global Average Pooling: Captures the overall context of the swing.
- Global Max Pooling: Captures the "peak action" moment of the hit.
- Concatenation: Merges both into a 1024-dimension vector for top-tier classification accuracy.
A single pass provides 7 distinct layers of analysis:
- Stroke Type: Smash, Clear, Drop, etc.
- Court Position: Mid-Court, Left-Back, Right-Front, etc.
- Technique: Forehand vs. Backhand.
- Quality: Performance score (1-7).
- Tactical Intent: Deception, Passive, Defensive, etc.
- Buffered Processing: Videos are read frame-by-frame on the server to prevent memory spikes and bypass macOS metadata race conditions.
- Sliding Window: Analysis happens in 1.5s windows with 0.75s overlaps, ensuring no action is missed between frames.
- Pose Synchronization: MediaPipe Pose estimation is run in parallel to provide visual "Skeleton Analysis" for every detected hit.
- LLM Coaching: The model's raw data is fed into Google Gemini, which acts as a virtual "Pro Coach" to provide 3 concise technical tips per segment.
| Script | Purpose | Focus | Use Case |
|---|---|---|---|
train_timesformer.py |
TimeSformer | Divided space–time ViT + pose token | Same split/cache as other video trainers; default --backbone vit uses ImageNet timm ViT; tuning ranges in pipelines/training/TIMESFORMER_HYPERPARAMS.md |
train_full.py |
Baseline | CNN-LSTM | Domain shift & court color invariance (Overnight) |
fine_tune.py |
Domain Adaptation | Heads-Only | Custom court floor & lighting (20 mins) |
Calibrated on one Apple Silicon (MPS) box with the current dataset (~318 train clips, default batch size 4). Your numbers will vary with GPU/CPU and batch size.
| Model | ~min / epoch | ~time for 60 epochs |
|---|---|---|
train_full.py (ResNet + LSTM) |
~3 | ~3 h |
train_timesformer.py |
scale partial runs | see below |
TimeSformer: Smoke runs that only train part of an epoch (e.g. --max-train-batches ≈ 50 steps) were ~2.5–3 min on the same machine. A full epoch scales about linearly with batches: multiply that wall time by (batches_per_epoch / 50). With ~80 train batches (batch 4, this split), expect ~4–5 min/epoch and ~4–5 h for 60 epochs—not the much larger “laptop day” guess you get if you assume tens of minutes per epoch. If your tqdm total per epoch is much higher (smaller batch size or different sampler length), scale up accordingly.
Install timm (ViT TimeSformer) via pip install -r backend/requirements.txt. Use val loss and val type acc in MLflow to compare runs (train/val split stays video-level only).