Capstone II: Advanced Sequential Recognition System utilizing Long Short-Term Memory (LSTM) for high-accuracy Khmer Sign Language translation.
System Workflow • Tech Stack • Installation • Model Specifications • Performance
Phase II (Codename: Observant Hawk) marks the transition from static gesture recognition to advanced sequential recognition. Developed as the Capstone II project, this system leverages temporal deep learning to interpret the fluid, motion-based nature of Khmer Sign Language (KSL).
By analyzing sequences of movement rather than isolated frames, the system provides a more natural and accurate communication bridge.
Sequential Intelligence
Powered by an LSTM-based neural network that models motion over time using 30-frame sequences.
Dual-Hand Tracking
Utilizes MediaPipe Hands to capture 126 coordinate features (x, y, z across 21 landmarks per hand).
Privacy-First Inference
Only coordinate tensors are transmitted to the backend. Raw video data remains on the client device.
95 Khmer Sign Classes
Supports a wide vocabulary including greetings, family-related terms, and common verbs.
Voice Synthesis
Integrates Google Text-to-Speech (gTTS) for real-time Khmer audio output.
The "Observant Hawk" architecture follows a sliding-window pipeline:
-
Capture
React frontend accesses the webcam and extracts 126 coordinates per frame using MediaPipe. -
Buffering
Accumulates a temporal sequence of 30 frames. -
Normalization
Applies zero-padding to maintain a consistent input shape of (30,126). -
Inference
The NumPy array is sent via POST request to the Flask API, where the Keras (.h5) model performs prediction. -
Output
Predicted Khmer text is returned and converted to speech via gTTS.
- TensorFlow 2.x (Keras Sequential API)
- LSTM (Long Short-Term Memory)
- MediaPipe
- NumPy
- Flask
- gTTS (Google Text-to-Speech)
- TensorFlow Serving
- React.js
- Tailwind CSS
- Framer Motion
| Feature | Detail |
|---|---|
| Input Shape | (30,126) — (Time Steps, Features) |
| Architecture | 3 × LSTM Layers (64, 128, 64 units) |
| Regularization | Dropout (0.2) |
| Output Layer | Dense with Softmax activation |
| Dataset Size | 1,900 videos across 95 classes |
| Model Size | ~4.1 MB |
- Accuracy: 80.0% (sequential validation)
- Latency: ~1.4 seconds end-to-end
- Memory Usage: ~580 MB runtime
- Environment: MacBook Pro M3 Max, Ubuntu 24.04
cd backend
python -m venv venv
# Activate environment
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windows
pip install -r requirements.txtPlace the model file:
/backend/model/model.h5Run the server:
python server-v2-dynamic.pyDefault port: 3000
npm install
npm run devDefault port: 5173
Sign Previewer
Interactive visual reference for all supported signs.
Adaptive Normalization
Handles variations in signing speed.
Coordinate Streaming
Efficient JSON-based transfer ensuring low bandwidth usage and privacy.
- Vann Vat — Team Leader & Data Engineer
- Phal Sovandy — UI/UX Designer & Content Lead
- Mony Meakputsoktheara — Machine Learning Engineer
- Chhi Hangcheav — Backend Developer
- Chim Panhaprasith — Frontend Developer
- Toek Hengsreng — Dataset & Research Analyst
Developed at the Cambodia Academy of Digital Technology (CADT)
Advancing accessibility through Computer Vision and Deep Learning