Personal pipeline I use to check dataset quality before/after annotation. Runs fully local — no API calls, no cloud, everything stays on your machine (needs internet once to download model weights from HuggingFace).
Three modes depending on where you are in the pipeline:
- Pre-annotation — check if your frames are worth annotating (diversity, blur, lighting)
- Pre-training — check your annotated dataset for intra-class variation and outliers
- Post-training — check if your trained YOLO model detections make sense
Always run from project root. The launcher shows a menu: option 1 = pre-annotation, option 2 = pre-training. Post-training isn't in the menu — run it directly.
exec.bat # Windows
./exec.sh # Linux/MacOr run directly:
python "master_scripts/master_script_dinov2_PreAnn.py" # pre-annotation
python "master_scripts/master_script_dinov2_PostAnn_PreTrain.py" # pre-training
python "master_scripts/master_script_dinov2_PostTrain.py" # post-training (edit config vars at top of script)Pre-annotation and pre-training support two modes: interactive (questionary prompts, default) and headless (CLI args, for programmatic/service use). Post-training requires editing config variables at the top of the script.
Headless usage:
# PostAnn/PreTrain — minimum required args
python "master_scripts/master_script_dinov2_PostAnn_PreTrain.py" \
--imgs_path /data/images --label_path /data/labels --classes_txt /data/classes.txt
# PreAnn — minimum required args
python "master_scripts/master_script_dinov2_PreAnn.py" --frames_dir /data/framesIn headless mode: output dirs wiped fresh each run, embedding cache disabled, no stdin prompts.
Script: master_scripts/master_script_dinov2_PreAnn.py
Use this when you have raw frames and want to know if they're worth annotating. Catches: redundant frames, motion blur, bad lighting, missing scenarios.
What it does:
- DINOv2 [CLS||avg_patches] embeddings (1536d) — adaptive: adds pose features when persons detected
- PCA whitening → PaCMAP dimensionality reduction → HDBSCAN clustering
- Motion blur detection via directional gradient anisotropy
- Generates a 5-page PDF report
Output: preann_results/PreAnnotation_Quality_Report_{FolderName}.pdf
Prompts:
- Path to frames folder
- "Change default parameters?" — No = sensible defaults, Yes = exposes everything
- Clustering params (always asked):
n_components: PCA dims before HDBSCAN (default: 128)min_cluster_size: Minimum frames per activity group (default: 25)min_samples: HDBSCAN core point strictness (default: 3)
- Use embedding cache? (Yes = faster re-runs)
Key parameters:
| Parameter | Default | Notes |
|---|---|---|
n_components |
128 | Higher = finer activity separation |
min_cluster_size |
25 | Lower = more small groups detected |
min_samples |
3 | Lower = more frames included in clusters |
anisotropy_threshold |
3.6 | Motion blur cutoff (2.5 strict, 5.0 lenient) |
batch_size |
64 | DINOv2 batch size |
PDF structure:
- Page 1: Executive summary — quality score, frame breakdown, recommendations
- Page 2: Quality metrics dashboard (brightness, blur histograms, lighting distribution)
- Page 3: PaCMAP scatter plot + activity breakdown table
- Pages 4–M: Activity montages (4 columns × up to 5 rows per activity)
- Page M+1: Coverage gap heatmap (activity × lighting)
Interpreting results:
Quality score (0–10): average of three sub-scores — sharpness (max(0, 10 - median_anisotropy/5*10)), brightness centering (10 - |median_brightness - 127|/127*10), contrast (min(10, median_contrast/50*10)). 7–10 = good, 5–7 = review needed, <5 = fix before annotating.
Activity count: 1–2 = low diversity (same angle/video), 3–6 = good, 7+ = excellent
Note: Always validate montages visually. Clusters can form on viewpoint/lighting rather than semantic content.
Output files:
preann_results/
├── PreAnnotation_Quality_Report_{FolderName}.pdf
└── outliers_{FolderName}/ (optional — prompted at end)
Script: master_scripts/master_script_dinov2_PostAnn_PreTrain.py
Use after annotation to understand intra-class variation — are your classes homogeneous or do they have distinct sub-groups? Finds outliers and annotation errors.
What it does: Crops bboxes from YOLO annotations → DINOv2 768d embeddings → DBSCAN clustering → PDF report
Output: postann_pretrain_results/clustering_results_{FolderName}/PDF_REPORT_{FolderName}.pdf
Prompts:
- Images folder, labels folder, classes.txt
- Output subfolder names (smart defaults from your dataset folder name)
- "Change default parameters?" gate
- Clustering flags (always asked): auto-tune eps, per-class UMAP scatter (off by default)
- k-NN percentile (if auto-tune on) or manual epsilon (if off)
- Generate PDF?
Default values (master script defaults — override the individual script's own defaults):
| Parameter | Default | Description |
|---|---|---|
batch_size |
64 | DINOv2 batch size |
auto_tune |
True | Auto-calculate DBSCAN epsilon per class |
auto_tune_percentile |
90 | k-NN percentile (90=tight, 95=balanced, 98=loose) |
epsilon |
0.15 | Manual eps fallback (only if auto-tune off) |
save_class_scatter |
False | Per-class UMAP scatter plots (slow — runs UMAP per class) |
min_pts |
3 | Min points per cluster |
umap_min_dist |
0.05 | UMAP min_dist |
max_cluster_samples |
20 | Max sample images per cluster (overrides Script 3's own default of 5) |
pca_components |
0 | PCA dims before clustering (0=disabled; reduces 768d→Nd before DBSCAN; set 128 for large classes 10K+ samples) |
uniform_downsample_target |
4000 | Downsample target for uniform classes (overrides Script 3's default of 5000) |
uniform_min_samples |
12000 | Trigger downsampling only if class exceeds this (overrides Script 3's default of 10000) |
use_embedding_cache |
True | Skip re-generation if embeddings exist |
num_workers |
4 | Crop extraction threads (behind "Change defaults?" gate) |
PDF prompts (after basic setup, when PDF generation is on):
- "Group contrastive class pairs in distribution chart?" — if Yes, tick class names from a checkbox list to form named groups (e.g., "fit vs wrong_fit"). A second grouped horizontal bar chart is added after the plain annotation chart, showing only the grouped classes. Legend labels are auto-inferred from class name suffixes (falls back to "Variant N").
PDF structure:
- Page 1: Dataset overview table → Pipeline config table → Annotation distribution bar chart (always) → Contrastive group chart (only if groups provided) → Per-class clustering summary table (+ annotation issue files detail if any issues found)
- Pages 2–N: Centroid overview plots (
centroid_overview_N.png) - Remaining pages: Per-class montages (split at cluster row boundaries, never mid-cluster)
- Last page: Insights & Recommendations — issue-centric layout grouped by severity (🔴 Critical → 🟡 Warning → 🟢 OK → ℹ Info). Each row = one unique issue type; "Classes Affected" column lists all classes that triggered it. Max ~1 page regardless of class count. No class-imbalance row.
Pipeline steps (run automatically):
postannotation_scripts/1. ann_txt_files_crop_bbox.py— crop + validate datasetpostannotation_scripts/2. save_dinov2_embeddings_per_class.py— generate embeddingspostannotation_scripts/3. clustering_of_classes_embeddings.py— DBSCAN clusteringpostannotation_scripts/4. generate_pdf.py— PDF report (optional)
Script: master_scripts/master_script_dinov2_PostTrain.py
Use after training to check if model detections are sensible. Not in ./exec.bat — run directly.
python "master_scripts/master_script_dinov2_PostTrain.py"Same pipeline as pre-training but Step 1 runs YOLO inference on video instead of reading annotation files.
Configuration variables (at top of script):
RESULTS_DIR = "posttrain_results"
model_path = "path/to/model.pt"
video_path = "path/to/video.mp4"
classes_space_separated = ["class0", "class1", "class2"]
per_class_num_frames = 1000
conf_thresh = 0.4
frame_stride_per_video = 3 # every 3rd frame = 10fps on 30fps videoFrame stride guide (30fps video):
| Stride | Effective FPS | Use when |
|---|---|---|
| 2 | 15 | High variation scenes |
| 3 | 10 | Recommended default |
| 5 | 6 | Slow-moving objects |
python "postannotation_scripts/1. ann_txt_files_crop_bbox.py" \
--imgs_label_path "path/to/LabelledData" \
--classes 0 1 2 \
--class_ids_to_names 0 class0 1 class1 2 class2 \
--output_dir cropped_imgs_by_classValidates dataset before cropping — catches unexpected class IDs, empty labels, imbalance ratio >10x, and duplicate image basenames (same filename in multiple formats, e.g. foo.png + foo.jpg in the same folder). Output: cropped_imgs_by_class/{class}/*.jpg
| Parameter | Default | Description |
|---|---|---|
--num_workers |
4 | Crop extraction threads (4=safe NVMe laptop, 8=fast desktop) |
python "postannotation_scripts/1. yolo_model_crop_bbox_per_class.py" \
--model "path/to/model.pt" \
--video "path/to/video.mp4" \
--classes class0 class1 class2 \
--frame_stride 3 \
--num_frames 100 \
--conf_thresh 0.4 \
--output "cropped_images"| Parameter | Default | Description |
|---|---|---|
--num_frames |
100 | Target samples per class (uniform sampling) |
--frame_stride |
3 | Process every Nth frame |
--conf_thresh |
0.4 | Detection confidence threshold |
python "postannotation_scripts/2. save_dinov2_embeddings_per_class.py" \
--root ./cropped_imgs_by_class --batch 32Generates 768d DINOv2 embeddings per class. Saves embeddings_dinov2.npy + embeddings_dinov2_image_list.txt (image-embedding alignment) per class folder.
| Parameter | Default | Description |
|---|---|---|
--batch |
32 | Batch size |
--save_suffix |
embeddings_dinov2.npy |
Output filename |
--use_cache |
False | Skip if embeddings already exist |
# Auto-tune (recommended)
python "postannotation_scripts/3. clustering_of_classes_embeddings.py" \
--root ./cropped_imgs_by_class --auto_tune --min_samples 3 --save_montage
# Manual epsilon
python "postannotation_scripts/3. clustering_of_classes_embeddings.py" \
--root ./cropped_imgs_by_class --eps 0.15 --min_samples 3| Parameter | Default | Description |
|---|---|---|
--eps |
0.15 | DBSCAN epsilon (0.1 strict, 0.3 lenient) |
--min_samples |
3 | Min points to form cluster |
--auto_tune |
False | k-NN percentile auto-tune per class |
--auto_tune_percentile |
90 | 90=tight, 95=balanced, 98=loose |
--umap_min_dist |
0.05 | UMAP min_dist (0.0=tight, 0.1=loose) |
--max_samples |
5 | Sample images per cluster (outliers: all saved) |
--save_class_scatter |
False | Per-class UMAP scatter plots (slow — runs UMAP per class) |
--save_montage |
False | Generate image grid montages |
--pca_components |
128 | PCA dims before DBSCAN (0=disabled; reduces 768d→Nd, improves large-class clustering) |
--uniform_eps_threshold |
0.10 | Below this eps = class treated as uniform |
--uniform_downsample_target |
5000 | Downsample target for uniform classes |
--uniform_min_samples |
10000 | Only downsample if class has more than this |
GPU backend: Script 3 auto-detects cuML at startup. If cuML is installed and a CUDA device is available, uses GPU-accelerated DBSCAN + kNN + UMAP. Falls back to sklearn/umap-learn (CPU) silently otherwise — no config needed.
Installing cuML (WSL2 / Linux server):
RAPIDS doesn't ship pip wheels. Requires mamba (conda's fast solver) — plain conda hangs forever on RAPIDS due to SAT solver limitations (500+ packages, 3 channels).
Option A — miniforge (recommended for personal machines, no root needed):
# Install miniforge to $HOME — ships mamba built-in, no root needed
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3
$HOME/miniforge3/bin/conda init bash
# reopen shell — mamba now available directlyOption B — install mamba into existing anaconda/conda (company machines where you can't change base install):
conda install -c conda-forge mamba -y # one-time, into base envCreate env + install (same for both options):
# Create fresh env with cuML (CUDA 12.8)
mamba create -n new_eda_tool -c rapidsai -c conda-forge -c nvidia \
cuml=25.* python=3.11 cuda-version=12.8 -y
# Activate
eval "$(mamba shell hook --shell bash)" # if shell not initialized yet
mamba activate new_eda_tool # or: conda activate new_eda_tool
# Install PyTorch — use pinned version, pip path only (conda pytorch channel has no cuda=12.8)
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
--index-url https://download.pytorch.org/whl/cu128
# Install pipeline deps
pip install -r requirements.txtVerify:
python -c "import cuml; from cuml.cluster import DBSCAN; print(cuml.__version__)"
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"Key lessons learned:
mamba createnotconda create— conda solver deadlocks on RAPIDS- miniforge = mamba built-in; plain anaconda = need
conda install mambafirst - PyTorch must be pinned (eg.,
torch==2.7.1) — unpinned pulls incompatible nvidia-* pip packages that conflict with conda's CUDA libs — remember to ensure that torch version is compatible with cuda - Do NOT
pip install nvidia-cudnn-cu12separately — torch's pinned install handles it; wrong version =libcudnn.soimport error - conda pytorch channel has no
pytorch-cuda=12.8yet — pip is the only path
Output:
clustering_results/
├── {class}_clusters.png (only if --save_class_scatter)
├── {class}_montage.jpg
├── {class}_samples/
│ ├── cluster_0/
│ └── outliers/
├── centroid_overview_1.png
└── cluster_statistics.csv
Interpreting cluster counts: 1 = homogeneous class, 2–4 = normal variation, 5+ = high diversity (check montages before splitting classes — clusters can reflect viewpoint/lighting, not always semantic sub-classes).
Outlier rates: <5% = good, 5–10% = moderate variation, >20% = check data quality or loosen eps.
Usually invoked automatically by the pre-training master script. Standalone use:
python "postannotation_scripts/4. generate_pdf.py" \
--temp_txt_file postann_pretrain_results/cropped_imgs_by_class_MyDataset/temp_ann_file.txt \
--clustering_dir postann_pretrain_results/clustering_results_MyDataset \
--pdf_name "postann_pretrain_results/clustering_results_MyDataset/PDF_REPORT_MyDataset" \
--imgs_path path/to/images \
--label_path path/to/labels \
--classes_txt path/to/classes.txt \
--auto_tune --auto_tune_percentile 95| --contrastive_groups_json | None | JSON string {"Group Name": ["class_a", "class_b"]} — adds a second grouped horizontal bar chart after the plain annotation chart, showing only grouped classes. Legend labels auto-inferred from class name suffixes. |
PDF structure: Page 1 = dataset overview table (includes Duplicate Image Basenames row) → pipeline config table → annotation distribution bar chart (always) → contrastive group chart (only if --contrastive_groups_json provided, grouped classes only) → per-class clustering summary table (+ annotation issue files detail if any issues found). Pages 2–N = centroid overview plots (centroid_overview_N.png). Remaining pages = per-class montages (split at cluster row boundaries, never mid-cluster). Last page = Insights & Recommendations — issue-centric, grouped by severity (🔴 Critical → 🟡 Warning → 🟢 OK → ℹ Info); each row is one issue type listing all affected classes. No class-imbalance row.
Interactive Plotly HTML explorer. Run after Script 2, independent of Script 3.
python "master_scripts/1. interactive_cluster_viewer.py" \
--root postann_pretrain_results/cropped_imgs_by_classPre-annotation output → what to do:
| Result | Action |
|---|---|
| Quality score < 5 | Fix data before annotating — re-shoot or filter frames |
| >30% blurry frames | Drop them or re-extract with higher shutter speed |
| >40% dark frames | Add well-lit frames or adjust camera settings |
| 1–2 activities | Low diversity — capture from more angles/scenarios |
| Activity montage shows mixed unrelated content | Decrease min_cluster_size (allow smaller tighter groups) or increase n_components (finer PCA detail) and re-run |
Pre/post-training clustering output → what to do:
| Result | Action |
|---|---|
| 0 clusters (all outliers) | eps too strict — increase --eps or use --auto_tune |
| 1 cluster | Class is visually homogeneous — nothing to split |
| 5+ clusters | Check montages first — if clusters are semantic, consider sub-classes; if viewpoint/lighting artifacts, ignore |
| >20% outliers | Check outlier folder — annotation errors or genuinely rare appearances |
| Class imbalance >10x flagged | Add more data for underrepresented class before training |
| Script | When to use |
|---|---|
master_scripts/master_script_dinov2_PreAnn.py |
Before annotation — frame quality check |
master_scripts/master_script_dinov2_PostAnn_PreTrain.py |
After annotation — intra-class variation |
master_scripts/master_script_dinov2_PostTrain.py |
After training — model detection check |
postannotation_scripts/1. ann_txt_files_crop_bbox.py |
Manual: crop from YOLO annotations |
postannotation_scripts/1. yolo_model_crop_bbox_per_class.py |
Manual: crop from model detections |
postannotation_scripts/2. save_dinov2_embeddings_per_class.py |
Manual: generate embeddings |
postannotation_scripts/3. clustering_of_classes_embeddings.py |
Manual: DBSCAN clustering |
postannotation_scripts/4. generate_pdf.py |
Manual: generate PDF report |
pip install -r requirements.txtCore: PyTorch >= 2.1 with CUDA, transformers (DINOv2), ultralytics (YOLO), scikit-learn, pacmap, hdbscan, opencv-python, pillow, numpy, pandas, matplotlib, seaborn, reportlab, tqdm, questionary
- DINOv2 clustering: https://medium.com/@EnginDenizTangut/%EF%B8%8F-image-clustering-with-dinov2-and-hdbscan-a-hands-on-guide-35c6e29036f2
- DBSCAN: https://medium.com/@sachinsoni600517/clustering-like-a-pro-a-beginners-guide-to-dbscan-6c8274c362c4
- Post-training pipeline inspiration: https://medium.com/@albertferrevidal/facings-product-identifier-using-yolov8-and-image-embeddings-d3ca34463022