Real-time perception systems usually face a simple tradeoff: run a large model for accuracy or a small model for speed. Most deployments pick one model and accept the compromise.
This project explores a different approach:
Can a system dynamically decide, frame by frame, when higher accuracy is worth the extra compute?
The result is a confidence-based inference router that runs a lightweight model on every frame and selectively escalates uncertain frames to a heavier model. The system preserves near-lightweight throughput while improving detection reliability.
The pipeline runs YOLOv8n (light model) on every frame. When the model appears uncertain — based on confidence, detection continuity, or object scale — the frame is escalated to YOLOv8l (heavy model).
Only those frames are recomputed with the larger model, and the heavier model’s output replaces the lightweight result for that frame.
Frame → YOLOv8n
|
| confident detection
└── use YOLOv8n result
| uncertainty detected
└── run YOLOv8l → replace detection
This creates a compute-aware inference policy rather than a fixed model choice.
The adaptive pipeline achieves:
- 124.6 FPS inference throughput
- 0.8655 precision (highest among all tested configurations)
while invoking the heavy model on only ~27% of frames in typical sequences.
Instead of committing to one model globally, the system allocates compute selectively based on detection confidence and scene context.
Heavy model escalation is triggered when the lightweight detector shows signs of failure:
- Low confidence detection
- Detection streak break (pedestrian disappears after multiple frames)
- Very small bounding boxes (far-away pedestrians)
These signals capture common failure modes of lightweight detectors.
Evaluation was performed on Caltech Pedestrian set00 (2,500 frames) with IoU = 0.5.
| Configuration | FPS | Recall | Precision | MD@100 |
|---|---|---|---|---|
| YOLOv8n pretrained | 124.5 | 0.2279 | 0.7671 | 143.28 |
| YOLOv8l pretrained | 99.3 | 0.2774 | 0.7602 | 134.08 |
| YOLOv8n fine-tuned | 127.9 | 0.2393 | 0.7317 | 141.16 |
| YOLOv8l fine-tuned | 94.5 | 0.2645 | 0.8319 | 136.48 |
| Adaptive Pipeline | 124.6 | 0.2442 | 0.8655 | 140.24 |
The adaptive system achieves higher precision than either standalone model, while maintaining throughput close to the lightweight detector.
Recall falls between the two base models, which is expected for cascaded systems where both detectors share similar architecture.
The scheduler primarily improves precision, not recall.
Recall increases slightly relative to the lightweight model:
YOLOv8n fine-tuned → 0.2393
Adaptive pipeline → 0.2442
but does not reach the recall of the large model.
Instead, the scheduler improves trustworthiness of detections by replacing uncertain predictions with higher-quality outputs.
When the system emits a detection, it is more likely to be correct.
The threshold T controls escalation sensitivity.
Lower thresholds mean the system escalates more conservatively.
| T | FPS | Recall | MD@100 | Heavy Model Triggered |
|---|---|---|---|---|
| 0.25 | 129.0 | 0.2464 | 139.84 | 24.6% |
| 0.35 | 125.5 | 0.2442 | 140.24 | 27.0% |
| 0.45 | 124.2 | 0.2442 | 140.24 | 28.0% |
| 0.55 | 122.9 | 0.2442 | 140.24 | 28.8% |
The best operating point was T = 0.25, balancing recall, throughput, and compute cost.
Both models were pretrained on COCO, which contains relatively few dense urban pedestrian scenes.
Running pretrained models directly on driving footage shows a clear domain gap:
| Model | Recall |
|---|---|
| YOLOv8n pretrained | 22.8% |
| YOLOv8l pretrained | 27.7% |
Fine-tuning on CrowdHuman, which contains dense and heavily occluded pedestrians, significantly improved performance:
| Model | Recall | Precision | mAP50 |
|---|---|---|---|
| YOLOv8n fine-tuned | 0.7017 | 0.8487 | 0.8155 |
| YOLOv8l fine-tuned | 0.7922 | 0.8717 | 0.8792 |
This demonstrates how dataset domain alignment strongly affects detection quality.
Input Frame
|
v
YOLOv8n (every frame)
|
|-- All detections confident
| → use YOLOv8n result
|
|-- Any trigger fires:
| - detection confidence < T
| - detection streak breaks
| - bounding box < 1% image area
|
v
YOLOv8l
|
v
Final detection output
The scheduler captures three common failure modes:
| Trigger | Failure Mode |
|---|---|
| Low confidence | ambiguous detection |
| Detection streak break | missed pedestrian |
| Small box | distant pedestrian |
Real-world perception systems operate under strict compute budgets.
The typical solution is to deploy a lightweight model and accept lower accuracy.
This project demonstrates an alternative strategy:
allocate compute dynamically where it matters most
The system adjusts automatically to scene complexity:
- Sparse scenes → heavy model rarely used
- Crowded scenes → heavy model invoked more frequently
The policy is also model-agnostic. Any pair of fast and accurate models can be used.
Used for model fine-tuning.
- 470k annotated persons
- average 23 people per image
- heavy occlusion and crowding
Used for evaluation.
- vehicle-mounted urban driving footage
- 30 FPS video
- detailed occlusion annotations
Evaluating on a dataset different from training tests whether improvements generalize beyond the training distribution.
adaptive-pedestrian-detection/
data/
caltech/
crowdhuman/
src/
scheduler.py
experiments/
run_baseline.py
run_finetuned_eval.py
run_adaptive.py
sweep_thresholds.py
plot_results.py
results/
full_comparison.csv
sweep_results.csv
figures/
demo/
demo_video.py
Python 3.10 PyTorch Ultralytics YOLOv8 OpenCV NumPy Pandas Matplotlib
Hardware: NVIDIA A100-SXM4-80GB
Shao et al. — CrowdHuman: A Benchmark for Detecting Human in a Crowd Dollar et al. — Pedestrian Detection: An Evaluation of the State of the Art Ultralytics — YOLOv8
