Real-time sign language gesture recognition using MediaPipe hand landmarks and an LSTM neural network with live webcam prediction.
People with hearing impairments face significant communication barriers with those who don't understand sign language. Traditional solutions require human interpreters, which aren't always available. This project builds a system that recognizes hand gestures in real-time through a webcam and translates them into text — making communication more accessible without requiring an intermediary.
The system operates in two phases:
- Capture webcam video using OpenCV
- Extract hand landmarks (left + right) using MediaPipe Holistic
- Record 30 sequences of 30 frames per gesture → stored as
.npykeypoint arrays - Train an LSTM model on the temporal sequences to learn gesture patterns
- Stream webcam feed in real-time
- Extract hand keypoints per frame using MediaPipe
- Buffer last 30 frames into a sliding window
- Predict the gesture using the trained LSTM model
- Display predicted text and probability bars on screen
Webcam Feed
│
▼
MediaPipe Holistic
(Hand Landmark Detection)
│
├── Left Hand: 21 landmarks × 3 (x, y, z) = 63 features
└── Right Hand: 21 landmarks × 3 (x, y, z) = 63 features
│
▼
126 Keypoints per Frame
│
▼
Sliding Window (30 frames)
│
▼
┌─────────────────────────┐
│ LSTM Network │
│ LSTM(64) → LSTM(128) │
│ LSTM(64) → Dense(64) │
│ Dense(32) → Softmax │
└─────────────────────────┘
│
▼
Predicted Gesture + Confidence
(displayed on video feed)
Input (30 timesteps × 126 features)
│
├── LSTM(64, return_sequences=True) │ 48,896 params
├── LSTM(128, return_sequences=True) │ 98,816 params
├── LSTM(64, return_sequences=False) │ 49,408 params
├── Dense(64, ReLU) │ 4,160 params
├── Dense(32, ReLU) │ 2,080 params
└── Dense(N, Softmax) │ 99 params
Total trainable parameters: 203,459
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Loss | Categorical Crossentropy |
| Epochs | 200 |
| Input Shape | (30, 126) — 30 frames, 126 keypoints |
| Prediction Threshold | 0.8 confidence |
| Checkpoint | Best model saved via ModelCheckpoint |
MediaPipe Holistic detects hand landmarks in each frame. Only hand connections are used (pose and face landmarks are excluded for efficiency):
| Hand | Landmarks | Features (x, y, z) |
|---|---|---|
| Left Hand | 21 | 63 |
| Right Hand | 21 | 63 |
| Total | 42 | 126 per frame |
If a hand is not detected in a frame, the keypoints default to zeros — making the model robust to single-hand gestures.
├── Training.ipynb # Data collection + model training
├── Testing.ipynb # Load model + live webcam prediction
├── LICENSE # MIT License
└── README.md
The system is designed to be easily extensible. Example gesture sets used:
- Letters: a, b, c, d, e, f
- Words: food, water, help
To add new gestures, update the actions array and run the data collection cells in Training.ipynb.
- Python 3.11
- MediaPipe — hand landmark detection via Holistic model
- OpenCV — webcam capture and video display
- TensorFlow / Keras — LSTM model building and training
- NumPy — keypoint array operations
- scikit-learn — train/test split, confusion matrix, accuracy score
git clone https://github.com/sidd707/sign-language-lstm-recognition.git
cd sign-language-lstm-recognition
pip install mediapipe opencv-python tensorflow numpy scikit-learn matplotlibjupyter notebook Training.ipynb
# Run cells sequentially — webcam will open for data collectionjupyter notebook Testing.ipynb
# Loads trained weights and starts real-time recognitionNote: A webcam is required for both data collection and live prediction.
This project is licensed under the MIT License — see the LICENSE file for details.