[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model by xiaotianlou · Pull Request #361 · SpaceY-Labs/RoCam

xiaotianlou · 2026-03-27T23:59:01Z

⚠️ EXPERIMENTAL - DO NOT MERGE

This branch contains training scripts for three parallel experiments to find the best model for 544x960 FP16 TensorRT deployment on Orin:

Plans

Plan	Model	imgsz	Strategy	GPU	Status
C	yolo26s (from scratch)	544	V3 augmentations, 300+100 ep	GPU 3	Training
D	smallrocket.pt (fine-tune)	544	Low LR, 80 ep	GPU 0	Training
E	yolo26s (from scratch)	640	V3 augmentations, 200 ep	GPU 1	Training

Background

P2 model (maxvalue.pt) achieves mAP50-95=0.765 during training but drops to 0.539 in DeepStream deployment due to:

Resolution mismatch (960 train → 544 deploy)
FP16 quantization degradation on P2 head (43K anchors)
Community-confirmed issue with P2 heads under FP16 TensorRT

Standard 3-head model (smallrocket.pt) outperforms P2 in deployment (0.631 vs 0.539).

Key Files

src/training/train_planC_phase1.py / train_planC_phase2.py / run_planC.bash
src/training/train_planD_finetune.py / run_planD.bash
src/training/train_planE_640.py / run_planE.bash
docs/EXPERIMENT_LOG.md - Full experiment log with results

Decision Criteria

Winning model must: mAP50-95 > 0.631 on DeepStream, FPS >= 60 on Orin

This script sets up a YOLO model for training with COCO dataset, including downloading and preparing negative samples.

┌────────────────────────┐ │ padding (灰色) │ ← ~208 行灰色填充 ├────────────────────────┤ │ │ │ 你的实际图片 544×960 │ ← 小目标像素数完全没变 │ │ ├────────────────────────┤ │ padding (灰色) │ ← ~208 行灰色填充 └────────────────────────┘ 960×960

Refactor albumentations integration to use monkey patching for custom augmentation pipeline. Update function to not require train_args and adjust model training call accordingly.

540 is not divisible by strides 8/16/32, causing feature map dimension errors in P2 head. 544/32=17 cleanly divides all strides. Made-with: Cursor

Key fixes over coco.py: - torchrun compatible (monkey patch propagates to all GPU processes) - nbs=batch prevents weight_decay 3x amplification - multi_scale=False (was resizing down to 480px, destroying small targets) - patience=0 guarantees close_mosaic triggers at ep250 - COCO negatives reduced from 4000 to 2000 - Albumentations: blur_limit 7->5, added Downscale, removed BboxParams Made-with: Cursor

Stage 2: SGD lr0=0.002, rect=True (~30% less padding), mosaic=0 Stage 3: SGD lr0=0.0002, Albumentations probabilities reduced 30% Both use model=path (not resume=True) for clean optimizer reset. Made-with: Cursor

evaluate.py: reports mAP broken down by small/medium/large targets visualize_augment.py: Level 0 validation - renders mosaic=0.8 vs 0.4 to visually confirm small targets survive augmentation pipeline Made-with: Cursor

- Uses torchrun for DDP (fixes monkey patch + enables rect=True) - Dynamic GPU detection with 40GB threshold - Auto-retry up to 3x per stage with resume from last.pt - tmux session survives terminal/SSH disconnect - Each stage writes .stageN_result for automatic chaining Made-with: Cursor

- torchrun caused all ranks to land on GPU 0 (OOM). Reverted to Ultralytics internal DDP (device="0,1,2,3") for Stage 1. - Stage 2/3 use single-GPU for rect=True + Albumentations support. - Added MKL_THREADING_LAYER=GNU and PYTORCH_CUDA_ALLOC_CONF. - Reduced default batch from 192 to 128 for GPU contention safety. - Pipeline script now uses plain python instead of torchrun. Made-with: Cursor

Root cause: V1 Stage 2/3 used SGD while Stage 1 used MuSGD (via auto), destroying learned conv weight distributions. Stage 2 lr0=0.002 was 6.7x too aggressive vs proven lr (train81: 0.0003). V2 changes: - New train_stage1b.py: extend from Stage 1 best.pt with MuSGD lr0=0.005 mosaic=0.2 close_mosaic=30, 200ep 4-GPU DDP - train_stage2.py: SGD->MuSGD, lr0 0.002->0.001, +warmup_epochs=5 - train_stage3.py: SGD->MuSGD, +warmup_epochs=3, patience 25->15 - run_pipeline.bash: V2 flow with stage1b/stage2_v2/stage3_v2 naming - Robust save_dir detection in all stages (fixes "?" path bug) Made-with: Cursor

MuSGD's Muon component proved incompatible with fine-tuning from pre-trained checkpoints - fresh Muon state disrupted learned weights. Tested lr0=0.005 and lr0=0.002, both showed post-warmup regression. Switched to proven SGD approach (cf. train81: SGD lr=0.0003 -> 0.7500): - Stage 1b: SGD lr0=0.0005, cos_lr=True (ep16: mAP50-95=0.7380, stable) - Stage 2: SGD lr0=0.0003 (matching train81's proven fine-tune lr) - Stage 3: SGD lr0=0.0001 (ultra-low polish) Made-with: Cursor

Made-with: Cursor

Key changes from V2: - Single GPU + gradient accumulation (nbs=128) for ALL phases, eliminating the DDP->single-GPU regime change that caused V2 Stage 2/3 regression - Native `augmentations` parameter for custom Albumentations (no monkey-patch), confirmed working via v8_transforms getattr(hyp, "augmentations", None) - Fix missed augmentation: scale 0.08->0.25, erasing 0.0->0.30 - Phase 1: 250ep SGD lr=0.0003, 9 custom Albumentations transforms - Phase 2: 80ep SGD lr=0.0001, reduced augmentation, rect=False (consistent) - Offline small target copy-paste script for bbox-only datasets Made-with: Cursor

Comprehensive documentation of the V3 training pipeline including: - Model architecture (YOLO26s-P2, 9.66M params, 26.4 GFLOPs) - Dataset statistics (31,666 train images, 12,976 boxes, 9.9% COCO-small) - V3 design rationale (regime consistency, native augmentations, small-target focus) - Phase 1: 250ep joint training (SGD lr=0.0003, nbs=128, 9 Albumentations) - Phase 2: 80ep fine-tuning (SGD lr=0.0001, reduced augmentation) - Final result: mAP50-95 = 0.7651 (all-time best) - Full parameter comparison tables and reproduction guide Made-with: Cursor

- train_planD_finetune.py: fine-tune smallrocket.pt at deployment resolution (544) with moderate augmentations and low LR - run_planD.bash: tmux launcher for Plan D - EXPERIMENT_LOG.md: record Plan A results (736x960 did not help), track Plan C and Plan D training progress Made-with: Cursor

Records all plan states, branch maps, training configs, results, and next steps. Enables new sessions to quickly recover context. Made-with: Cursor

- train_planE_640.py: compromise resolution between 544 and 960 - run_planE.bash: tmux launcher, conservative batch=32 for shared GPU - Tests if slightly larger training resolution improves generalization while still deploying at 544x960 Made-with: Cursor

Made-with: Cursor

- Plan A concluded (not viable): 3 resolutions tested, FP16 quantization confirmed as root cause - Plan C: ep213/300, val mAP=0.543, DS benchmark in progress (ONNX path) - Plan D: completed, DS mAP=0.641 but small-target regression (0.312 vs 0.341) - Plan E: ep184/200, val mAP=0.549 - Plan F: just started (ep1/80), small-target focused fine-tuning - Added numpy PCG64 workaround documentation - Added benchmark code verification notes

@544

Plan C (yolo26s from scratch @544) significantly underperforms baseline (smallrocket mAP=0.631, small=0.341). From-scratch training at ep213 is insufficient; fine-tuning remains the stronger approach.

Root cause analysis: all previous plans failed because they used drastically weaker augmentation than the original train63 config: - mosaic=0~0.30 vs train63's 1.0 - lr0=0.0002 vs train63's 0.01 (50x lower) - cos_lr=True vs train63's linear decay - close_mosaic=0~40 vs train63's 80 - 80 epochs vs train63's 420 Plan G replicates train63's exact augmentation recipe while fine-tuning smallrocket.pt at deployment resolution (544).

@544

- Plan H: train63 recipe from scratch @544, 420ep (GPU 3) - Plan I: tiled dataset + train63, 300ep (GPU 0) - Plan J: custom bbox copy-paste + train63, 420ep (GPU 1) - Plan G continues on GPU 2 Stopped Plan C/E/F (proven ineffective). Root cause: previous plans used mosaic=0~0.30, lr0=0.0002 vs baseline's mosaic=1.0, lr0=0.01.

…atus - Complete experiment timeline from Plan A through Phase 2 - train63 config analysis table (root cause of all failures) - Phase 2 plans G/H/I/J all running: G=ep14, H=ep5, I=caching, J=ep4 - Plan I tiled dataset: 15832 orig + 34909 tiles = 50741 mixed - Plan J copy-paste: 3581 small object crops extracted and injected - Fixed prepare_tiles.py missing os import

xiaotianlou and others added 28 commits January 29, 2026 13:29

#259: update cv code

12dba10

#259: update gr cv code

dd7a5a3

#259: rename

110988e

#259: update cv

f72509e

#259: update cv lab code

766495a

#259: update para

e32b91c

#259: update para

cb62c7c

using P2 s4 for small target

ca21580

This script sets up a YOLO model for training with COCO dataset, including downloading and preparing negative samples.

update

b26aa05

Update CUDA device settings and 960960

bce2303

Refactor albumentations integration and training call

71ab36f

Refactor albumentations integration to use monkey patching for custom augmentation pipeline. Update function to not require train_args and adjust model training call accordingly.

add finetune code

346ce81

fix: ONNX export resolution 540->544 for P2 stride divisibility

8618562

540 is not divisible by strides 8/16/32, causing feature map dimension errors in P2 head. 544/32=17 cleanly divides all strides. Made-with: Cursor

feat: add Stage 2 (rect=True fine-tune) and Stage 3 (low-LR polish)

2460438

Stage 2: SGD lr0=0.002, rect=True (~30% less padding), mosaic=0 Stage 3: SGD lr0=0.0002, Albumentations probabilities reduced 30% Both use model=path (not resume=True) for clean optimizer reset. Made-with: Cursor

feat: add size-stratified evaluation and augmentation visualization

99706c4

evaluate.py: reports mAP broken down by small/medium/large targets visualize_augment.py: Level 0 validation - renders mosaic=0.8 vs 0.4 to visually confirm small targets survive augmentation pipeline Made-with: Cursor

chore: remove tracked eval outputs, add to gitignore

7fac818

Made-with: Cursor

Merge branch 'main' into CV_yolo26

cd6c98d

docs: comprehensive experiment log for context recovery

87dbd8e

Records all plan states, branch maps, training configs, results, and next steps. Enables new sessions to quickly recover context. Made-with: Cursor

xiaotianlou added the experiment Experimental branch - do not merge label Mar 27, 2026

docs: update experiment log with Plan E and PR links

f7034d2

Made-with: Cursor

xiaotianlou added 7 commits March 27, 2026 23:51

Add Plan C DeepStream benchmark result: mAP=0.499, small=0.238

098e540

Plan C (yolo26s from scratch @544) significantly underperforms baseline (smallrocket mAP=0.631, small=0.341). From-scratch training at ep213 is insufficient; fine-tuning remains the stronger approach.

Remove accidentally committed .pt model files, update .gitignore

41ba301

Final context sync before session handoff - add recovery guide

c0e762b

xiaotianlou closed this Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361

[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361
xiaotianlou wants to merge 36 commits intomainfrom
CV_planC_544

xiaotianlou commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaotianlou commented Mar 27, 2026

⚠️ EXPERIMENTAL - DO NOT MERGE

Plans

Background

Key Files

Decision Criteria

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant