[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361
Closed
xiaotianlou wants to merge 36 commits intomainfrom
Closed
[Experiment] Plan C/D/E: Multi-resolution training for optimal deployment model#361xiaotianlou wants to merge 36 commits intomainfrom
xiaotianlou wants to merge 36 commits intomainfrom
Conversation
This script sets up a YOLO model for training with COCO dataset, including downloading and preparing negative samples.
┌────────────────────────┐
│ padding (灰色) │ ← ~208 行灰色填充
├────────────────────────┤
│ │
│ 你的实际图片 544×960 │ ← 小目标像素数完全没变
│ │
├────────────────────────┤
│ padding (灰色) │ ← ~208 行灰色填充
└────────────────────────┘
960×960
Refactor albumentations integration to use monkey patching for custom augmentation pipeline. Update function to not require train_args and adjust model training call accordingly.
540 is not divisible by strides 8/16/32, causing feature map dimension errors in P2 head. 544/32=17 cleanly divides all strides. Made-with: Cursor
Key fixes over coco.py: - torchrun compatible (monkey patch propagates to all GPU processes) - nbs=batch prevents weight_decay 3x amplification - multi_scale=False (was resizing down to 480px, destroying small targets) - patience=0 guarantees close_mosaic triggers at ep250 - COCO negatives reduced from 4000 to 2000 - Albumentations: blur_limit 7->5, added Downscale, removed BboxParams Made-with: Cursor
Stage 2: SGD lr0=0.002, rect=True (~30% less padding), mosaic=0 Stage 3: SGD lr0=0.0002, Albumentations probabilities reduced 30% Both use model=path (not resume=True) for clean optimizer reset. Made-with: Cursor
evaluate.py: reports mAP broken down by small/medium/large targets visualize_augment.py: Level 0 validation - renders mosaic=0.8 vs 0.4 to visually confirm small targets survive augmentation pipeline Made-with: Cursor
- Uses torchrun for DDP (fixes monkey patch + enables rect=True) - Dynamic GPU detection with 40GB threshold - Auto-retry up to 3x per stage with resume from last.pt - tmux session survives terminal/SSH disconnect - Each stage writes .stageN_result for automatic chaining Made-with: Cursor
- torchrun caused all ranks to land on GPU 0 (OOM). Reverted to Ultralytics internal DDP (device="0,1,2,3") for Stage 1. - Stage 2/3 use single-GPU for rect=True + Albumentations support. - Added MKL_THREADING_LAYER=GNU and PYTORCH_CUDA_ALLOC_CONF. - Reduced default batch from 192 to 128 for GPU contention safety. - Pipeline script now uses plain python instead of torchrun. Made-with: Cursor
Root cause: V1 Stage 2/3 used SGD while Stage 1 used MuSGD (via auto), destroying learned conv weight distributions. Stage 2 lr0=0.002 was 6.7x too aggressive vs proven lr (train81: 0.0003). V2 changes: - New train_stage1b.py: extend from Stage 1 best.pt with MuSGD lr0=0.005 mosaic=0.2 close_mosaic=30, 200ep 4-GPU DDP - train_stage2.py: SGD->MuSGD, lr0 0.002->0.001, +warmup_epochs=5 - train_stage3.py: SGD->MuSGD, +warmup_epochs=3, patience 25->15 - run_pipeline.bash: V2 flow with stage1b/stage2_v2/stage3_v2 naming - Robust save_dir detection in all stages (fixes "?" path bug) Made-with: Cursor
MuSGD's Muon component proved incompatible with fine-tuning from pre-trained checkpoints - fresh Muon state disrupted learned weights. Tested lr0=0.005 and lr0=0.002, both showed post-warmup regression. Switched to proven SGD approach (cf. train81: SGD lr=0.0003 -> 0.7500): - Stage 1b: SGD lr0=0.0005, cos_lr=True (ep16: mAP50-95=0.7380, stable) - Stage 2: SGD lr0=0.0003 (matching train81's proven fine-tune lr) - Stage 3: SGD lr0=0.0001 (ultra-low polish) Made-with: Cursor
Made-with: Cursor
Key changes from V2: - Single GPU + gradient accumulation (nbs=128) for ALL phases, eliminating the DDP->single-GPU regime change that caused V2 Stage 2/3 regression - Native `augmentations` parameter for custom Albumentations (no monkey-patch), confirmed working via v8_transforms getattr(hyp, "augmentations", None) - Fix missed augmentation: scale 0.08->0.25, erasing 0.0->0.30 - Phase 1: 250ep SGD lr=0.0003, 9 custom Albumentations transforms - Phase 2: 80ep SGD lr=0.0001, reduced augmentation, rect=False (consistent) - Offline small target copy-paste script for bbox-only datasets Made-with: Cursor
Comprehensive documentation of the V3 training pipeline including: - Model architecture (YOLO26s-P2, 9.66M params, 26.4 GFLOPs) - Dataset statistics (31,666 train images, 12,976 boxes, 9.9% COCO-small) - V3 design rationale (regime consistency, native augmentations, small-target focus) - Phase 1: 250ep joint training (SGD lr=0.0003, nbs=128, 9 Albumentations) - Phase 2: 80ep fine-tuning (SGD lr=0.0001, reduced augmentation) - Final result: mAP50-95 = 0.7651 (all-time best) - Full parameter comparison tables and reproduction guide Made-with: Cursor
- train_planD_finetune.py: fine-tune smallrocket.pt at deployment resolution (544) with moderate augmentations and low LR - run_planD.bash: tmux launcher for Plan D - EXPERIMENT_LOG.md: record Plan A results (736x960 did not help), track Plan C and Plan D training progress Made-with: Cursor
Records all plan states, branch maps, training configs, results, and next steps. Enables new sessions to quickly recover context. Made-with: Cursor
- train_planE_640.py: compromise resolution between 544 and 960 - run_planE.bash: tmux launcher, conservative batch=32 for shared GPU - Tests if slightly larger training resolution improves generalization while still deploying at 544x960 Made-with: Cursor
Made-with: Cursor
- Plan A concluded (not viable): 3 resolutions tested, FP16 quantization confirmed as root cause - Plan C: ep213/300, val mAP=0.543, DS benchmark in progress (ONNX path) - Plan D: completed, DS mAP=0.641 but small-target regression (0.312 vs 0.341) - Plan E: ep184/200, val mAP=0.549 - Plan F: just started (ep1/80), small-target focused fine-tuning - Added numpy PCG64 workaround documentation - Added benchmark code verification notes
Plan C (yolo26s from scratch @544) significantly underperforms baseline (smallrocket mAP=0.631, small=0.341). From-scratch training at ep213 is insufficient; fine-tuning remains the stronger approach.
Root cause analysis: all previous plans failed because they used drastically weaker augmentation than the original train63 config: - mosaic=0~0.30 vs train63's 1.0 - lr0=0.0002 vs train63's 0.01 (50x lower) - cos_lr=True vs train63's linear decay - close_mosaic=0~40 vs train63's 80 - 80 epochs vs train63's 420 Plan G replicates train63's exact augmentation recipe while fine-tuning smallrocket.pt at deployment resolution (544).
- Plan H: train63 recipe from scratch @544, 420ep (GPU 3) - Plan I: tiled dataset + train63, 300ep (GPU 0) - Plan J: custom bbox copy-paste + train63, 420ep (GPU 1) - Plan G continues on GPU 2 Stopped Plan C/E/F (proven ineffective). Root cause: previous plans used mosaic=0~0.30, lr0=0.0002 vs baseline's mosaic=1.0, lr0=0.01.
…atus - Complete experiment timeline from Plan A through Phase 2 - train63 config analysis table (root cause of all failures) - Phase 2 plans G/H/I/J all running: G=ep14, H=ep5, I=caching, J=ep4 - Plan I tiled dataset: 15832 orig + 34909 tiles = 50741 mixed - Plan J copy-paste: 3581 small object crops extracted and injected - Fixed prepare_tiles.py missing os import
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This branch contains training scripts for three parallel experiments to find the best model for 544x960 FP16 TensorRT deployment on Orin:
Plans
Background
P2 model (maxvalue.pt) achieves mAP50-95=0.765 during training but drops to 0.539 in DeepStream deployment due to:
Standard 3-head model (smallrocket.pt) outperforms P2 in deployment (0.631 vs 0.539).
Key Files
src/training/train_planC_phase1.py/train_planC_phase2.py/run_planC.bashsrc/training/train_planD_finetune.py/run_planD.bashsrc/training/train_planE_640.py/run_planE.bashdocs/EXPERIMENT_LOG.md- Full experiment log with resultsDecision Criteria
Winning model must: mAP50-95 > 0.631 on DeepStream, FPS >= 60 on Orin