This repository provides a complete, modular pipeline for training YOLO models on the KITTI Object Detection Dataset & RAG‑Powered LLM Query Engine (Llama3 + FAISS). It includes:
- Automated dataset download
- KITTI <-> YOLO label conversion
- Train/val split
- Model training on Google Colab GPU
- Evaluation (mAP50 & mAP50-95)
- Model benchmarking across FP32, FP32-GPU, FP16, and INT8
- FAISS vector index and document store used by the RAG‑powered query application
- RAG‑Powered LLM Query Engine (Llama3 + FAISS)
- Visualization of predictions vs. ground truth
- Sample video inference results
The entire workflow is orchestrated through a Colab notebook for reproducibility and GPU acceleration.
'-- download_kittidataset.py
'-- label_convertor.py
'-- evaluate_yolo.py
'-- benchmark_model.py
'-- visualize_predictions.py
'-- generate_faiss_doc.py
'-- llmquery_app.py
'-- data/ <-- generated files & label format.txt
'-- google_collab_trigger_training.ipynb
'-- kitti.yaml
'-- README.md
- Python 3.8 or higher
Install all required packages using:
pip install -r requirements.txt
Or install manually:
- ultralytics
- opencv-python
- matplotlib
- numpy
- tqdm
- requests
- zipfile36 (if using Python < 3.10)
- PyYAML
- onnxruntime-tools
- glob2
- faiss-cpu
- streamlit
- sentence-transformers
- onnx
- onnxruntime-gpu
- pycuda
Additionally, need TensorRT and trtexec system packages
- GPU runtime enabled (Runtime → Change runtime type → GPU)
- Google Drive mounted for persistent storage
The pipeline automatically downloads:
- KITTI Object Detection Dataset (images + labels)
- Recommended: Google Colab GPU (T4, L4, or A100)
- Not recommended: CPU-only training (extremely slow)
- Downloads the KITTI object detection dataset (images + labels)
- Unzips and organizes the dataset
- Optionally visualizes a sample image with its KITTI label
- Ensures the dataset is ready for conversion
- Converts KITTI label format <-> YOLO label format based on input arg as
convert_typ - Normalizes bounding boxes
- Creates the YOLO folder structure:
kitti_yolo/
|-- images/
'-- train/
'-- val/
|-- labels/
'-- train/
'-- val/
The yolo to kitti converted labels are available by default in runs\detect\predict\kitti_labels
- Performs a train/val split
- Generates the
kitti.yamlconfiguration file
- Loads a trained YOLO model
- Evaluates on the validation set
- Outputs mAP50, mAP50-95, precision, and recall
- Loads validation images
- Draws ground truth bounding boxes
- Runs YOLO inference and draws predicted bounding boxes
- Displays side-by-side comparison for qualitative inspection
This script benchmarks a trained YOLO model across CPU, GPU, and TensorRT backends, covering multiple precisions:
- FP32 (PyTorch)
- INT8 (ONNX Runtime PTQ)
- FP32 (PyTorch)
- FP16 (PyTorch)
- FP32 (TensorRT)
- FP16 (TensorRT)
- INT8 (TensorRT)
Note:
TensorRT accuracy is currently evaluated using the exported ONNX model as a proxy.
The benchmarking script does not yet run full TensorRT inference for accuracy.
Latency, however, is measured from the actual TensorRT engine.
For each backend and precision, the script reports:
mAP50mAP50-95latency_ms(average inference latency per image)
| Backend | Precision | mAP50 | mAP50–95 | Latency (ms) |
|---|---|---|---|---|
| CPU | FP32 | 0.8648 | 0.5959 | 140.42 |
| CPU | INT8 | 0.8682 | 0.5976 | 717.26 |
| GPU | FP32 | 0.8648 | 0.5959 | 14.29 |
| GPU | FP16 | 0.8647 | 0.5950 | 15.16 |
| TensorRT | FP32 | 0.8682 | 0.5976 | 0.39 |
| TensorRT | FP16 | 0.8682 | 0.5976 | 0.40 |
| TensorRT | INT8 | 0.8682 | 0.5976 | 0.40 |
- TensorRT FP32/FP16/INT8 accuracy values match ONNX FP32 accuracy because the current pipeline uses ONNX as a proxy for accuracy evaluation.
- CPU INT8 (ONNX PTQ) also matches FP32 accuracy
- TensorRT delivers the highest performance, reducing latency from 14 ms (PyTorch GPU) to 0.4 ms.
- FP32, FP16, and INT8 TensorRT latencies are nearly identical for this model because:
- Small YOLO models (n/s) often show no latency difference between FP32/FP16/INT8 in TensorRT.
- Larger models (YOLOv8m/l/x) will show clear separation between precisions.
- CPU INT8 is slower than CPU FP32 because ONNX Runtime INT8 uses CPU-only kernels without hardware acceleration.
Implement a full TensorRT inference loop:
- load engine
- allocate buffers
- run inference per batch
- apply YOLO postprocessing
- compute mAP50/mAP95
This will allow reporting real TensorRT accuracy instead of ONNX proxy accuracy.
Use TensorRT entropy calibrators to generate:
- accurate INT8 scales
- minimal accuracy drop
- real INT8 speedups
Larger models will reveal:
- FP16 > FP32 speedups
- INT8 > FP16 speedups
- Use Post‑Training Quantization (PTQ)
- 2–4× CPU speedup
This benchmarking suite now provides a complete evaluation pipeline for YOLO model across CPU, GPU, and TensorRT backends, with clear separation between:
- ONNX‑based accuracy
- TensorRT latency
- PyTorch baseline performance
This script now builds two FAISS indexes:
Used for natural‑language queries about scene content:
-
number of cars / pedestrians / cyclists
-
maximum occlusion
-
maximum truncation
-
fuzzy terms (crowded, busy, heavy occlusion, etc.)
Each frame produces a document containing:
-
metadata (counts, occlusion, truncation)
-
summary text
-
image path
-
FAISS embedding
The script now:
-
Loads KITTI ground‑truth labels
-
Loads YOLO predictions converted to KITTI format
-
Computes IoU for each GT–prediction pair
Identifies:
Stores detailed error documents containing:
-
error_type (FP or FN)
-
class
-
iou
-
bbox
-
occlusion_level (from GT)
-
truncation_value (from GT)
-
summary_text
-
image_path
Stores:
error_docs.json → list of detailed error documents with error type and class
error_index.faiss → FAISS index for fast similarity search
kitti_docs.json → list of documents with metadata + summary
kitti_index.faiss → FAISS index for fast similarity search
embedding_model.txt → name of the embedding model used
These are generated into data/
This Streamlit application lets you query in 2 modes the KITTI dataset using natural language, combining:
Numeric filtering (e.g., “more than 5 pedestrians”)
Fuzzy interpretation (e.g., “crowded”, “busy”, “heavy occlusion”)
Synonym expansion (e.g., “packed”, “dense”, “traffic heavy”)
Semantic search using FAISS + SentenceTransformer
Local LLM reasoning using Ollama Llama3
Capabilities
Understands multi‑condition queries
Applies fuzzy rules from fuzzy_rules.json
Falls back to semantic search when numeric filters match nothing
Displays matching KITTI frames with images and metadata
FP/FN filtering (e.g. "false positives for cyclists")
IoU filtering (e.g. "IoU < 0.4 errors" or "IoU > 0.7")
Occlusion/truncation filtering (e.g. "missed pedestrians with occlusion 3")
Semantic search using FAISS + SentenceTransformer
-
GT → green
-
Predictions → yellow
-
FP → red
-
FN → blue
Only GT classes relevant to YOLO (Car, Pedestrian, Cyclist) are drawn.
Running the LLM Query App
- Start Ollama with Llama3
Before launching the app, you must start the local LLM server:
ollama run llama3
This downloads the model (first time only) and starts the inference engine. Leave this terminal open.
- Run the Streamlit App
In a new terminal:
streamlit run llmquery_app.py
The app will automatically:
Load FAISS index
Load SentenceTransformer embeddings
Load fuzzy rules
Connect to Ollama
Interpret your query
Display matching KITTI frames or side by side depending on query mode
Example Queries:
"more than 5 pedestrians"
"busy intersection with few cyclists"
"heavy occlusion and more than 2 cars"
"atleast 2 cars and more than 3 pedestrians"
"crowded and heavy occlusion"
"crowded and busy with rare cyclists"
"few cyclists and rare pedestrians"
“false positives for cyclists”
“missed pedestrians with occlusion 3”
“IoU < 0.4 errors for cars”
“FN for truncation > 0.5”
All scripts are orchestrated inside:
This notebook:
- Mounts Google Drive
- Downloads KITTI dataset
- Converts KITTI → YOLO
- Triggers YOLO training on GPU
- Evaluates the trained model
- Visualizes predictions
- Runs precision benchmarking
- Generate prediction label files (YOLO format)
- Convert prediction label files: YOLO -> KITTI
- generate faiss vector idnex and document store
This YAML file defines:
- Train image paths
- Validation image paths
- Class names
Example: (available in data/)
path: kitti_yolo train: images/train val: images/val names: 0: Car 1: Pedestrian 2: Cyclist
After training, inference results are saved to: runs/detect/predict/
This folder contains:
- Annotated video
- Annotated frames
- Prediction logs
Open:
google_collab_trigger_training.ipynb
Run all cells sequentially.
Scripts will run, but full training is slow without a GPU.
- Each script is modular and can run independently
- The Colab notebook ties everything together
- Supports YOLOv8, YOLOv9, YOLOv10, YOLOv11
- Results are reproducible and easy to extend
- Benchmarking module provides deeper insight into deployment performance
- builds the FAISS vector index, document store for GT & Error document generation (FP/FN/IoU/occ/trunc)
- Streamlit application lets you RAG‑powered LLM query engine for KITTI dataset using natural language
- Query engine for error analysis shows Side‑by‑side GT vs prediction visualization
- Future work will focus on TensorRT and OpenVINO acceleration