This Jupyter notebook implements a two-stage pipeline for automatic fire detection and localization in forest imagery using state-of-the-art vision-language and self-supervised models:
- Fire Detection – Binary classification using OpenAI’s CLIP-ViT model.
- Fire Localization – Patch-level anomaly detection using Facebook’s DINOv2 model with cosine similarity for feature matching.
-
Utilizes CLIP (Contrastive Language–Image Pretraining) for zero-shot classification.
-
Embeds image and prompt texts into a shared semantic space.
-
Prompts used:
"a normal forest scene""a fire or smoke in a forest"
-
Classification is based on cosine similarity between image and text embeddings.
-
Outputs class label (
fire/no fire) along with probability scores.
-
Uses DINOv2 for self-supervised feature extraction.
-
Extracts patch-level features from both:
- Target image
- A reference normal forest image (used as baseline context)
-
Computes cosine similarity between patches to localize anomalous (i.e. fire-related) regions.
-
Visualizes results with anomaly heatmaps overlaid on the original image.
Notes: To Do : test SAM approach for fire localization
Required Libraries:
- Python 3.8+
- PyTorch (CUDA support recommended)
- Transformers (
CLIP,DINOv2) - OpenCV, PIL, NumPy, Matplotlib
result, probs = detectfile("/path/to/image.jpg")
print(f"Fire Detected: {bool(result)} | Probabilities: {probs}")- Load DINOv2 model and a reference image (non-fire scene).
- Extract patch features from both test and reference images.
- Compute per-patch cosine similarity.
- Generate and display heatmap highlighting anomalous regions.
- Developed and tested in a Kaggle notebook environment.
- Models are sourced from Hugging Face Hub.
- Includes fallback logic for CUDA initialization issues.
- Detection thresholding is static; performance can be optimized via calibration.
- Localization assumes a clean reference image (
AoF06726.jpg) for anomaly comparison.
-
Train or fine-tune models for improved robustness and generalization.
-
Extend pipeline to support:
- Video input
- Real-time inference
-
Replace static thresholding with adaptive scoring (e.g., via Gaussian models or learned thresholds).
-
Automate reference image selection or compute scene-level context clusters.