[Feature]  What is the upper limit to the performance improvement of Dflash training as the step increases?

### Checklist

- [ ] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/SpecForge/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

Hi maintainers,
I’m trying to reproduce the official DFlash training pipeline, following the instructions from issue #465. During reproduction, I’m observing that DFlash decoding efficiency keeps improving as training steps increase, but the marginal gains become quite small in the later stage. I’d like to understand what the expected “ceiling” (if any) is when scaling training steps/epochs.

Observation:
In my runs, improvements are clearly diminishing, but even after ~12 epochs the metrics are still moving upward.

Measured results:
official baseline: speedup: 4.72x, τ: 5.97
Step 60,000: speedup: 2.58×, τ: 3.18
Step 80,000: speedup: 2.64×, τ: 3.26
Step 100,000: speedup: 2.71×, τ: 3.34
Step 142,000: speedup: 2.86×, τ: 3.55

(Training curves attached: train/accuracy and train/loss. They look stable overall, and the improvement trend flattens later.)
<img width="1481" height="720" alt="Image" src="https://github.com/user-attachments/assets/67194718-59fc-4bc2-b9ea-8e5a35d1fc1b" />
Additionally, is this task strongly correlated with the dataset? Is multi-dataset training necessary, or are there other training techniques? Welcome everyone to participate in the discussion.

### Related resources

Here's my training shell template
`set -e

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NUM_GPUS=${1:-8}

ATTENTION_BACKEND=${2:-flex_attention}

torchrun \
  --standalone \
  --nproc_per_node $NUM_GPUS \
  $ROOT_DIR/scripts/train_dflash.py \
  --target-model-path /models/Qwen3-8B \
  --draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json \
  --train-data-path $ROOT_DIR/cache/dataset/ultrachat_train_regen_nonthinking.jsonl \
  --output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-v2 \
  --num-epochs 20 \
  --batch-size 2 \
  --accumulation-steps 2 \
  --learning-rate 6e-4 \
  --warmup-ratio 0.04 \
  --max-grad-norm 1.0 \
  --max-length 3072 \
  --chat-template qwen \
  --attention-backend $ATTENTION_BACKEND \
  --random-anchor \
  --num-anchors 512 \
  --loss-decay-gamma 7.0 \
  --log-interval 50 \
  --save-interval 1000 \
  --cache-dir $ROOT_DIR/cache \
  --report-to tensorboard`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] What is the upper limit to the performance improvement of Dflash training as the step increases? #469

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] What is the upper limit to the performance improvement of Dflash training as the step increases? #469

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions