EDA Intra-Class Variation Scripts

Personal pipeline I use to check dataset quality before/after annotation. Runs fully local — no API calls, no cloud, everything stays on your machine (needs internet once to download model weights from HuggingFace).

Three modes depending on where you are in the pipeline:

Pre-annotation — check if your frames are worth annotating (diversity, blur, lighting)
Pre-training — check your annotated dataset for intra-class variation and outliers
Post-training — check if your trained YOLO model detections make sense

Quick Start

Always run from project root. The launcher shows a menu: option 1 = pre-annotation, option 2 = pre-training. Post-training isn't in the menu — run it directly.

exec.bat        # Windows
./exec.sh       # Linux/Mac

Or run directly:

python "master_scripts/master_script_dinov2_PreAnn.py"    # pre-annotation
python "master_scripts/master_script_dinov2_PostAnn_PreTrain.py"  # pre-training
python "master_scripts/master_script_dinov2_PostTrain.py"         # post-training (edit config vars at top of script)

Pre-annotation and pre-training support two modes: interactive (questionary prompts, default) and headless (CLI args, for programmatic/service use). Post-training requires editing config variables at the top of the script.

Headless usage:

# PostAnn/PreTrain — minimum required args
python "master_scripts/master_script_dinov2_PostAnn_PreTrain.py" \
    --imgs_path /data/images --label_path /data/labels --classes_txt /data/classes.txt

# PreAnn — minimum required args
python "master_scripts/master_script_dinov2_PreAnn.py" --frames_dir /data/frames

In headless mode: output dirs wiped fresh each run, embedding cache disabled, no stdin prompts.

Pre-Annotation: Frame Quality Assessment

Script: master_scripts/master_script_dinov2_PreAnn.py

Use this when you have raw frames and want to know if they're worth annotating. Catches: redundant frames, motion blur, bad lighting, missing scenarios.

What it does:

DINOv2 [CLS||avg_patches] embeddings (1536d) — adaptive: adds pose features when persons detected
PCA whitening → PaCMAP dimensionality reduction → HDBSCAN clustering
Motion blur detection via directional gradient anisotropy
Generates a 5-page PDF report

Output: preann_results/PreAnnotation_Quality_Report_{FolderName}.pdf

Prompts:

Path to frames folder
"Change default parameters?" — No = sensible defaults, Yes = exposes everything
Clustering params (always asked):
- n_components: PCA dims before HDBSCAN (default: 128)
- min_cluster_size: Minimum frames per activity group (default: 25)
- min_samples: HDBSCAN core point strictness (default: 3)
Use embedding cache? (Yes = faster re-runs)

Key parameters:

Parameter	Default	Notes
`n_components`	128	Higher = finer activity separation
`min_cluster_size`	25	Lower = more small groups detected
`min_samples`	3	Lower = more frames included in clusters
`anisotropy_threshold`	3.6	Motion blur cutoff (2.5 strict, 5.0 lenient)
`batch_size`	64	DINOv2 batch size

PDF structure:

Page 1: Executive summary — quality score, frame breakdown, recommendations
Page 2: Quality metrics dashboard (brightness, blur histograms, lighting distribution)
Page 3: PaCMAP scatter plot + activity breakdown table
Pages 4–M: Activity montages (4 columns × up to 5 rows per activity)
Page M+1: Coverage gap heatmap (activity × lighting)

Interpreting results:

Quality score (0–10): average of three sub-scores — sharpness (max(0, 10 - median_anisotropy/5*10)), brightness centering (10 - |median_brightness - 127|/127*10), contrast (min(10, median_contrast/50*10)). 7–10 = good, 5–7 = review needed, <5 = fix before annotating.

Activity count: 1–2 = low diversity (same angle/video), 3–6 = good, 7+ = excellent

Note: Always validate montages visually. Clusters can form on viewpoint/lighting rather than semantic content.

Output files:

preann_results/
├── PreAnnotation_Quality_Report_{FolderName}.pdf
└── outliers_{FolderName}/   (optional — prompted at end)

Pre-Training: Annotated Dataset Analysis

Script: master_scripts/master_script_dinov2_PostAnn_PreTrain.py

Use after annotation to understand intra-class variation — are your classes homogeneous or do they have distinct sub-groups? Finds outliers and annotation errors.

What it does: Crops bboxes from YOLO annotations → DINOv2 768d embeddings → DBSCAN clustering → PDF report

Output: postann_pretrain_results/clustering_results_{FolderName}/PDF_REPORT_{FolderName}.pdf

Prompts:

Images folder, labels folder, classes.txt
Output subfolder names (smart defaults from your dataset folder name)
"Change default parameters?" gate
Clustering flags (always asked): auto-tune eps, per-class UMAP scatter (off by default)
k-NN percentile (if auto-tune on) or manual epsilon (if off)
Generate PDF?

Default values (master script defaults — override the individual script's own defaults):

Parameter	Default	Description
`batch_size`	64	DINOv2 batch size
`auto_tune`	True	Auto-calculate DBSCAN epsilon per class
`auto_tune_percentile`	90	k-NN percentile (90=tight, 95=balanced, 98=loose)
`epsilon`	0.15	Manual eps fallback (only if auto-tune off)
`save_class_scatter`	False	Per-class UMAP scatter plots (slow — runs UMAP per class)
`min_pts`	3	Min points per cluster
`umap_min_dist`	0.05	UMAP min_dist
`max_cluster_samples`	20	Max sample images per cluster (overrides Script 3's own default of 5)
`pca_components`	0	PCA dims before clustering (0=disabled; reduces 768d→Nd before DBSCAN; set 128 for large classes 10K+ samples)
`uniform_downsample_target`	4000	Downsample target for uniform classes (overrides Script 3's default of 5000)
`uniform_min_samples`	12000	Trigger downsampling only if class exceeds this (overrides Script 3's default of 10000)
`use_embedding_cache`	True	Skip re-generation if embeddings exist
`num_workers`	4	Crop extraction threads (behind "Change defaults?" gate)

PDF prompts (after basic setup, when PDF generation is on):

"Group contrastive class pairs in distribution chart?" — if Yes, tick class names from a checkbox list to form named groups (e.g., "fit vs wrong_fit"). A second grouped horizontal bar chart is added after the plain annotation chart, showing only the grouped classes. Legend labels are auto-inferred from class name suffixes (falls back to "Variant N").

PDF structure:

Page 1: Dataset overview table → Pipeline config table → Annotation distribution bar chart (always) → Contrastive group chart (only if groups provided) → Per-class clustering summary table (+ annotation issue files detail if any issues found)
Pages 2–N: Centroid overview plots (centroid_overview_N.png)
Remaining pages: Per-class montages (split at cluster row boundaries, never mid-cluster)
Last page: Insights & Recommendations — issue-centric layout grouped by severity (🔴 Critical → 🟡 Warning → 🟢 OK → ℹ Info). Each row = one unique issue type; "Classes Affected" column lists all classes that triggered it. Max ~1 page regardless of class count. No class-imbalance row.

Pipeline steps (run automatically):

postannotation_scripts/1. ann_txt_files_crop_bbox.py — crop + validate dataset
postannotation_scripts/2. save_dinov2_embeddings_per_class.py — generate embeddings
postannotation_scripts/3. clustering_of_classes_embeddings.py — DBSCAN clustering
postannotation_scripts/4. generate_pdf.py — PDF report (optional)

Post-Training: Model Detection Analysis

Script: master_scripts/master_script_dinov2_PostTrain.py

Use after training to check if model detections are sensible. Not in ./exec.bat — run directly.

python "master_scripts/master_script_dinov2_PostTrain.py"

Same pipeline as pre-training but Step 1 runs YOLO inference on video instead of reading annotation files.

Configuration variables (at top of script):

RESULTS_DIR = "posttrain_results"
model_path = "path/to/model.pt"
video_path = "path/to/video.mp4"
classes_space_separated = ["class0", "class1", "class2"]
per_class_num_frames = 1000
conf_thresh = 0.4
frame_stride_per_video = 3        # every 3rd frame = 10fps on 30fps video

Frame stride guide (30fps video):

Stride	Effective FPS	Use when
2	15	High variation scenes
3	10	Recommended default
5	6	Slow-moving objects

Individual Scripts

postannotation_scripts/1. ann_txt_files_crop_bbox.py

python "postannotation_scripts/1. ann_txt_files_crop_bbox.py" \
    --imgs_label_path "path/to/LabelledData" \
    --classes 0 1 2 \
    --class_ids_to_names 0 class0 1 class1 2 class2 \
    --output_dir cropped_imgs_by_class

Validates dataset before cropping — catches unexpected class IDs, empty labels, imbalance ratio >10x, and duplicate image basenames (same filename in multiple formats, e.g. foo.png + foo.jpg in the same folder). Output: cropped_imgs_by_class/{class}/*.jpg

Parameter	Default	Description
`--num_workers`	4	Crop extraction threads (4=safe NVMe laptop, 8=fast desktop)

postannotation_scripts/1. yolo_model_crop_bbox_per_class.py

python "postannotation_scripts/1. yolo_model_crop_bbox_per_class.py" \
    --model "path/to/model.pt" \
    --video "path/to/video.mp4" \
    --classes class0 class1 class2 \
    --frame_stride 3 \
    --num_frames 100 \
    --conf_thresh 0.4 \
    --output "cropped_images"

Parameter	Default	Description
`--num_frames`	100	Target samples per class (uniform sampling)
`--frame_stride`	3	Process every Nth frame
`--conf_thresh`	0.4	Detection confidence threshold

postannotation_scripts/2. save_dinov2_embeddings_per_class.py

python "postannotation_scripts/2. save_dinov2_embeddings_per_class.py" \
    --root ./cropped_imgs_by_class --batch 32

Generates 768d DINOv2 embeddings per class. Saves embeddings_dinov2.npy + embeddings_dinov2_image_list.txt (image-embedding alignment) per class folder.

Parameter	Default	Description
`--batch`	32	Batch size
`--save_suffix`	`embeddings_dinov2.npy`	Output filename
`--use_cache`	False	Skip if embeddings already exist

postannotation_scripts/3. clustering_of_classes_embeddings.py

# Auto-tune (recommended)
python "postannotation_scripts/3. clustering_of_classes_embeddings.py" \
    --root ./cropped_imgs_by_class --auto_tune --min_samples 3 --save_montage

# Manual epsilon
python "postannotation_scripts/3. clustering_of_classes_embeddings.py" \
    --root ./cropped_imgs_by_class --eps 0.15 --min_samples 3

Parameter	Default	Description
`--eps`	0.15	DBSCAN epsilon (0.1 strict, 0.3 lenient)
`--min_samples`	3	Min points to form cluster
`--auto_tune`	False	k-NN percentile auto-tune per class
`--auto_tune_percentile`	90	90=tight, 95=balanced, 98=loose
`--umap_min_dist`	0.05	UMAP min_dist (0.0=tight, 0.1=loose)
`--max_samples`	5	Sample images per cluster (outliers: all saved)
`--save_class_scatter`	False	Per-class UMAP scatter plots (slow — runs UMAP per class)
`--save_montage`	False	Generate image grid montages
`--pca_components`	128	PCA dims before DBSCAN (0=disabled; reduces 768d→Nd, improves large-class clustering)
`--uniform_eps_threshold`	0.10	Below this eps = class treated as uniform
`--uniform_downsample_target`	5000	Downsample target for uniform classes
`--uniform_min_samples`	10000	Only downsample if class has more than this

GPU backend: Script 3 auto-detects cuML at startup. If cuML is installed and a CUDA device is available, uses GPU-accelerated DBSCAN + kNN + UMAP. Falls back to sklearn/umap-learn (CPU) silently otherwise — no config needed.

Installing cuML (WSL2 / Linux server):

RAPIDS doesn't ship pip wheels. Requires mamba (conda's fast solver) — plain conda hangs forever on RAPIDS due to SAT solver limitations (500+ packages, 3 channels).

Option A — miniforge (recommended for personal machines, no root needed):

# Install miniforge to $HOME — ships mamba built-in, no root needed
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3
$HOME/miniforge3/bin/conda init bash
# reopen shell — mamba now available directly

Option B — install mamba into existing anaconda/conda (company machines where you can't change base install):

conda install -c conda-forge mamba -y   # one-time, into base env

Create env + install (same for both options):

# Create fresh env with cuML (CUDA 12.8)
mamba create -n new_eda_tool -c rapidsai -c conda-forge -c nvidia \
    cuml=25.* python=3.11 cuda-version=12.8 -y

# Activate
eval "$(mamba shell hook --shell bash)"   # if shell not initialized yet
mamba activate new_eda_tool              # or: conda activate new_eda_tool

# Install PyTorch — use pinned version, pip path only (conda pytorch channel has no cuda=12.8)
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 \
    --index-url https://download.pytorch.org/whl/cu128

# Install pipeline deps
pip install -r requirements.txt

Verify:

python -c "import cuml; from cuml.cluster import DBSCAN; print(cuml.__version__)"
python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"

Key lessons learned:

mamba create not conda create — conda solver deadlocks on RAPIDS
miniforge = mamba built-in; plain anaconda = need conda install mamba first
PyTorch must be pinned (eg., torch==2.7.1) — unpinned pulls incompatible nvidia-* pip packages that conflict with conda's CUDA libs — remember to ensure that torch version is compatible with cuda
Do NOT pip install nvidia-cudnn-cu12 separately — torch's pinned install handles it; wrong version = libcudnn.so import error
conda pytorch channel has no pytorch-cuda=12.8 yet — pip is the only path

Output:

clustering_results/
├── {class}_clusters.png   (only if --save_class_scatter)
├── {class}_montage.jpg
├── {class}_samples/
│   ├── cluster_0/
│   └── outliers/
├── centroid_overview_1.png
└── cluster_statistics.csv

Interpreting cluster counts: 1 = homogeneous class, 2–4 = normal variation, 5+ = high diversity (check montages before splitting classes — clusters can reflect viewpoint/lighting, not always semantic sub-classes).

Outlier rates: <5% = good, 5–10% = moderate variation, >20% = check data quality or loosen eps.

postannotation_scripts/4. generate_pdf.py

Usually invoked automatically by the pre-training master script. Standalone use:

python "postannotation_scripts/4. generate_pdf.py" \
    --temp_txt_file postann_pretrain_results/cropped_imgs_by_class_MyDataset/temp_ann_file.txt \
    --clustering_dir postann_pretrain_results/clustering_results_MyDataset \
    --pdf_name "postann_pretrain_results/clustering_results_MyDataset/PDF_REPORT_MyDataset" \
    --imgs_path path/to/images \
    --label_path path/to/labels \
    --classes_txt path/to/classes.txt \
    --auto_tune --auto_tune_percentile 95

| --contrastive_groups_json | None | JSON string {"Group Name": ["class_a", "class_b"]} — adds a second grouped horizontal bar chart after the plain annotation chart, showing only grouped classes. Legend labels auto-inferred from class name suffixes. |

PDF structure: Page 1 = dataset overview table (includes Duplicate Image Basenames row) → pipeline config table → annotation distribution bar chart (always) → contrastive group chart (only if --contrastive_groups_json provided, grouped classes only) → per-class clustering summary table (+ annotation issue files detail if any issues found). Pages 2–N = centroid overview plots (centroid_overview_N.png). Remaining pages = per-class montages (split at cluster row boundaries, never mid-cluster). Last page = Insights & Recommendations — issue-centric, grouped by severity (🔴 Critical → 🟡 Warning → 🟢 OK → ℹ Info); each row is one issue type listing all affected classes. No class-imbalance row.

master_scripts/1. interactive_cluster_viewer.py (optional)

Interactive Plotly HTML explorer. Run after Script 2, independent of Script 3.

python "master_scripts/1. interactive_cluster_viewer.py" \
    --root postann_pretrain_results/cropped_imgs_by_class

Decision Guide

Pre-annotation output → what to do:

Result	Action
Quality score < 5	Fix data before annotating — re-shoot or filter frames
>30% blurry frames	Drop them or re-extract with higher shutter speed
>40% dark frames	Add well-lit frames or adjust camera settings
1–2 activities	Low diversity — capture from more angles/scenarios
Activity montage shows mixed unrelated content	Decrease `min_cluster_size` (allow smaller tighter groups) or increase `n_components` (finer PCA detail) and re-run

Pre/post-training clustering output → what to do:

Result	Action
0 clusters (all outliers)	eps too strict — increase `--eps` or use `--auto_tune`
1 cluster	Class is visually homogeneous — nothing to split
5+ clusters	Check montages first — if clusters are semantic, consider sub-classes; if viewpoint/lighting artifacts, ignore
>20% outliers	Check outlier folder — annotation errors or genuinely rare appearances
Class imbalance >10x flagged	Add more data for underrepresented class before training

Pipeline Summary

Script	When to use
`master_scripts/master_script_dinov2_PreAnn.py`	Before annotation — frame quality check
`master_scripts/master_script_dinov2_PostAnn_PreTrain.py`	After annotation — intra-class variation
`master_scripts/master_script_dinov2_PostTrain.py`	After training — model detection check
`postannotation_scripts/1. ann_txt_files_crop_bbox.py`	Manual: crop from YOLO annotations
`postannotation_scripts/1. yolo_model_crop_bbox_per_class.py`	Manual: crop from model detections
`postannotation_scripts/2. save_dinov2_embeddings_per_class.py`	Manual: generate embeddings
`postannotation_scripts/3. clustering_of_classes_embeddings.py`	Manual: DBSCAN clustering
`postannotation_scripts/4. generate_pdf.py`	Manual: generate PDF report

Dependencies

pip install -r requirements.txt

Core: PyTorch >= 2.1 with CUDA, transformers (DINOv2), ultralytics (YOLO), scikit-learn, pacmap, hdbscan, opencv-python, pillow, numpy, pandas, matplotlib, seaborn, reportlab, tqdm, questionary

References

DINOv2 clustering: https://medium.com/@EnginDenizTangut/%EF%B8%8F-image-clustering-with-dinov2-and-hdbscan-a-hands-on-guide-35c6e29036f2
DBSCAN: https://medium.com/@sachinsoni600517/clustering-like-a-pro-a-beginners-guide-to-dbscan-6c8274c362c4
Post-training pipeline inspiration: https://medium.com/@albertferrevidal/facings-product-identifier-using-yolov8-and-image-embeddings-d3ca34463022

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
change_log		change_log
master_scripts		master_scripts
postannotation_scripts		postannotation_scripts
utils		utils
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
exec.bat		exec.bat
exec.sh		exec.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDA Intra-Class Variation Scripts

Quick Start

Pre-Annotation: Frame Quality Assessment

Pre-Training: Annotated Dataset Analysis

Post-Training: Model Detection Analysis

Individual Scripts

postannotation_scripts/1. ann_txt_files_crop_bbox.py

postannotation_scripts/1. yolo_model_crop_bbox_per_class.py

postannotation_scripts/2. save_dinov2_embeddings_per_class.py

postannotation_scripts/3. clustering_of_classes_embeddings.py

postannotation_scripts/4. generate_pdf.py

master_scripts/1. interactive_cluster_viewer.py (optional)

Decision Guide

Pipeline Summary

Dependencies

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EDA Intra-Class Variation Scripts

Quick Start

Pre-Annotation: Frame Quality Assessment

Pre-Training: Annotated Dataset Analysis

Post-Training: Model Detection Analysis

Individual Scripts

postannotation_scripts/1. ann_txt_files_crop_bbox.py

postannotation_scripts/1. yolo_model_crop_bbox_per_class.py

postannotation_scripts/2. save_dinov2_embeddings_per_class.py

postannotation_scripts/3. clustering_of_classes_embeddings.py

postannotation_scripts/4. generate_pdf.py

master_scripts/1. interactive_cluster_viewer.py (optional)

Decision Guide

Pipeline Summary

Dependencies

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages