Skip to content

[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361

Closed
xiaotianlou wants to merge 36 commits intomainfrom
CV_planC_544
Closed

[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361
xiaotianlou wants to merge 36 commits intomainfrom
CV_planC_544

Conversation

@xiaotianlou
Copy link
Copy Markdown
Collaborator

⚠️ EXPERIMENTAL - DO NOT MERGE

This branch contains training scripts for three parallel experiments to find the best model for 544x960 FP16 TensorRT deployment on Orin:

Plans

Plan Model imgsz Strategy GPU Status
C yolo26s (from scratch) 544 V3 augmentations, 300+100 ep GPU 3 Training
D smallrocket.pt (fine-tune) 544 Low LR, 80 ep GPU 0 Training
E yolo26s (from scratch) 640 V3 augmentations, 200 ep GPU 1 Training

Background

P2 model (maxvalue.pt) achieves mAP50-95=0.765 during training but drops to 0.539 in DeepStream deployment due to:

  1. Resolution mismatch (960 train → 544 deploy)
  2. FP16 quantization degradation on P2 head (43K anchors)
  3. Community-confirmed issue with P2 heads under FP16 TensorRT

Standard 3-head model (smallrocket.pt) outperforms P2 in deployment (0.631 vs 0.539).

Key Files

  • src/training/train_planC_phase1.py / train_planC_phase2.py / run_planC.bash
  • src/training/train_planD_finetune.py / run_planD.bash
  • src/training/train_planE_640.py / run_planE.bash
  • docs/EXPERIMENT_LOG.md - Full experiment log with results

Decision Criteria

Winning model must: mAP50-95 > 0.631 on DeepStream, FPS >= 60 on Orin

xiaotianlou and others added 28 commits January 29, 2026 13:29
This script sets up a YOLO model for training with COCO dataset, including downloading and preparing negative samples.
┌────────────────────────┐
│     padding (灰色)      │  ← ~208 行灰色填充
├────────────────────────┤
│                        │
│   你的实际图片 544×960    │  ← 小目标像素数完全没变
│                        │
├────────────────────────┤
│     padding (灰色)      │  ← ~208 行灰色填充
└────────────────────────┘
         960×960
Refactor albumentations integration to use monkey patching for custom augmentation pipeline. Update function to not require train_args and adjust model training call accordingly.
540 is not divisible by strides 8/16/32, causing feature map
dimension errors in P2 head. 544/32=17 cleanly divides all strides.

Made-with: Cursor
Key fixes over coco.py:
- torchrun compatible (monkey patch propagates to all GPU processes)
- nbs=batch prevents weight_decay 3x amplification
- multi_scale=False (was resizing down to 480px, destroying small targets)
- patience=0 guarantees close_mosaic triggers at ep250
- COCO negatives reduced from 4000 to 2000
- Albumentations: blur_limit 7->5, added Downscale, removed BboxParams

Made-with: Cursor
Stage 2: SGD lr0=0.002, rect=True (~30% less padding), mosaic=0
Stage 3: SGD lr0=0.0002, Albumentations probabilities reduced 30%
Both use model=path (not resume=True) for clean optimizer reset.

Made-with: Cursor
evaluate.py: reports mAP broken down by small/medium/large targets
visualize_augment.py: Level 0 validation - renders mosaic=0.8 vs 0.4
  to visually confirm small targets survive augmentation pipeline

Made-with: Cursor
- Uses torchrun for DDP (fixes monkey patch + enables rect=True)
- Dynamic GPU detection with 40GB threshold
- Auto-retry up to 3x per stage with resume from last.pt
- tmux session survives terminal/SSH disconnect
- Each stage writes .stageN_result for automatic chaining

Made-with: Cursor
- torchrun caused all ranks to land on GPU 0 (OOM). Reverted to
  Ultralytics internal DDP (device="0,1,2,3") for Stage 1.
- Stage 2/3 use single-GPU for rect=True + Albumentations support.
- Added MKL_THREADING_LAYER=GNU and PYTORCH_CUDA_ALLOC_CONF.
- Reduced default batch from 192 to 128 for GPU contention safety.
- Pipeline script now uses plain python instead of torchrun.

Made-with: Cursor
Root cause: V1 Stage 2/3 used SGD while Stage 1 used MuSGD (via auto),
destroying learned conv weight distributions. Stage 2 lr0=0.002 was
6.7x too aggressive vs proven lr (train81: 0.0003).

V2 changes:
- New train_stage1b.py: extend from Stage 1 best.pt with MuSGD lr0=0.005
  mosaic=0.2 close_mosaic=30, 200ep 4-GPU DDP
- train_stage2.py: SGD->MuSGD, lr0 0.002->0.001, +warmup_epochs=5
- train_stage3.py: SGD->MuSGD, +warmup_epochs=3, patience 25->15
- run_pipeline.bash: V2 flow with stage1b/stage2_v2/stage3_v2 naming
- Robust save_dir detection in all stages (fixes "?" path bug)

Made-with: Cursor
MuSGD's Muon component proved incompatible with fine-tuning from
pre-trained checkpoints - fresh Muon state disrupted learned weights.
Tested lr0=0.005 and lr0=0.002, both showed post-warmup regression.

Switched to proven SGD approach (cf. train81: SGD lr=0.0003 -> 0.7500):
- Stage 1b: SGD lr0=0.0005, cos_lr=True (ep16: mAP50-95=0.7380, stable)
- Stage 2: SGD lr0=0.0003 (matching train81's proven fine-tune lr)
- Stage 3: SGD lr0=0.0001 (ultra-low polish)

Made-with: Cursor
Key changes from V2:
- Single GPU + gradient accumulation (nbs=128) for ALL phases, eliminating
  the DDP->single-GPU regime change that caused V2 Stage 2/3 regression
- Native `augmentations` parameter for custom Albumentations (no monkey-patch),
  confirmed working via v8_transforms getattr(hyp, "augmentations", None)
- Fix missed augmentation: scale 0.08->0.25, erasing 0.0->0.30
- Phase 1: 250ep SGD lr=0.0003, 9 custom Albumentations transforms
- Phase 2: 80ep SGD lr=0.0001, reduced augmentation, rect=False (consistent)
- Offline small target copy-paste script for bbox-only datasets

Made-with: Cursor
Comprehensive documentation of the V3 training pipeline including:
- Model architecture (YOLO26s-P2, 9.66M params, 26.4 GFLOPs)
- Dataset statistics (31,666 train images, 12,976 boxes, 9.9% COCO-small)
- V3 design rationale (regime consistency, native augmentations, small-target focus)
- Phase 1: 250ep joint training (SGD lr=0.0003, nbs=128, 9 Albumentations)
- Phase 2: 80ep fine-tuning (SGD lr=0.0001, reduced augmentation)
- Final result: mAP50-95 = 0.7651 (all-time best)
- Full parameter comparison tables and reproduction guide

Made-with: Cursor
- train_planD_finetune.py: fine-tune smallrocket.pt at deployment
  resolution (544) with moderate augmentations and low LR
- run_planD.bash: tmux launcher for Plan D
- EXPERIMENT_LOG.md: record Plan A results (736x960 did not help),
  track Plan C and Plan D training progress

Made-with: Cursor
Records all plan states, branch maps, training configs, results,
and next steps. Enables new sessions to quickly recover context.

Made-with: Cursor
- train_planE_640.py: compromise resolution between 544 and 960
- run_planE.bash: tmux launcher, conservative batch=32 for shared GPU
- Tests if slightly larger training resolution improves generalization
  while still deploying at 544x960

Made-with: Cursor
@xiaotianlou xiaotianlou added the experiment Experimental branch - do not merge label Mar 27, 2026
- Plan A concluded (not viable): 3 resolutions tested, FP16 quantization confirmed as root cause
- Plan C: ep213/300, val mAP=0.543, DS benchmark in progress (ONNX path)
- Plan D: completed, DS mAP=0.641 but small-target regression (0.312 vs 0.341)
- Plan E: ep184/200, val mAP=0.549
- Plan F: just started (ep1/80), small-target focused fine-tuning
- Added numpy PCG64 workaround documentation
- Added benchmark code verification notes
Plan C (yolo26s from scratch @544) significantly underperforms baseline
(smallrocket mAP=0.631, small=0.341). From-scratch training at ep213 is
insufficient; fine-tuning remains the stronger approach.
Root cause analysis: all previous plans failed because they used
drastically weaker augmentation than the original train63 config:
- mosaic=0~0.30 vs train63's 1.0
- lr0=0.0002 vs train63's 0.01 (50x lower)
- cos_lr=True vs train63's linear decay
- close_mosaic=0~40 vs train63's 80
- 80 epochs vs train63's 420

Plan G replicates train63's exact augmentation recipe while fine-tuning
smallrocket.pt at deployment resolution (544).
- Plan H: train63 recipe from scratch @544, 420ep (GPU 3)
- Plan I: tiled dataset + train63, 300ep (GPU 0)
- Plan J: custom bbox copy-paste + train63, 420ep (GPU 1)
- Plan G continues on GPU 2

Stopped Plan C/E/F (proven ineffective).
Root cause: previous plans used mosaic=0~0.30, lr0=0.0002
vs baseline's mosaic=1.0, lr0=0.01.
…atus

- Complete experiment timeline from Plan A through Phase 2
- train63 config analysis table (root cause of all failures)
- Phase 2 plans G/H/I/J all running: G=ep14, H=ep5, I=caching, J=ep4
- Plan I tiled dataset: 15832 orig + 34909 tiles = 50741 mixed
- Plan J copy-paste: 3581 small object crops extracted and injected
- Fixed prepare_tiles.py missing os import
@xiaotianlou xiaotianlou closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

experiment Experimental branch - do not merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant