Checklist
Motivation
Hi maintainers,
I’m trying to reproduce the official DFlash training pipeline, following the instructions from issue #465. During reproduction, I’m observing that DFlash decoding efficiency keeps improving as training steps increase, but the marginal gains become quite small in the later stage. I’d like to understand what the expected “ceiling” (if any) is when scaling training steps/epochs.
Observation:
In my runs, improvements are clearly diminishing, but even after ~12 epochs the metrics are still moving upward.
Measured results:
official baseline: speedup: 4.72x, τ: 5.97
Step 60,000: speedup: 2.58×, τ: 3.18
Step 80,000: speedup: 2.64×, τ: 3.26
Step 100,000: speedup: 2.71×, τ: 3.34
Step 142,000: speedup: 2.86×, τ: 3.55
(Training curves attached: train/accuracy and train/loss. They look stable overall, and the improvement trend flattens later.)

Additionally, is this task strongly correlated with the dataset? Is multi-dataset training necessary, or are there other training techniques? Welcome everyone to participate in the discussion.
Related resources
Here's my training shell template
`set -e
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NUM_GPUS=${1:-8}
ATTENTION_BACKEND=${2:-flex_attention}
torchrun
--standalone
--nproc_per_node $NUM_GPUS
$ROOT_DIR/scripts/train_dflash.py
--target-model-path /models/Qwen3-8B
--draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json
--train-data-path $ROOT_DIR/cache/dataset/ultrachat_train_regen_nonthinking.jsonl
--output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-v2
--num-epochs 20
--batch-size 2
--accumulation-steps 2
--learning-rate 6e-4
--warmup-ratio 0.04
--max-grad-norm 1.0
--max-length 3072
--chat-template qwen
--attention-backend $ATTENTION_BACKEND
--random-anchor
--num-anchors 512
--loss-decay-gamma 7.0
--log-interval 50
--save-interval 1000
--cache-dir $ROOT_DIR/cache
--report-to tensorboard`
Checklist
Motivation
Hi maintainers,
I’m trying to reproduce the official DFlash training pipeline, following the instructions from issue #465. During reproduction, I’m observing that DFlash decoding efficiency keeps improving as training steps increase, but the marginal gains become quite small in the later stage. I’d like to understand what the expected “ceiling” (if any) is when scaling training steps/epochs.
Observation:
In my runs, improvements are clearly diminishing, but even after ~12 epochs the metrics are still moving upward.
Measured results:
official baseline: speedup: 4.72x, τ: 5.97
Step 60,000: speedup: 2.58×, τ: 3.18
Step 80,000: speedup: 2.64×, τ: 3.26
Step 100,000: speedup: 2.71×, τ: 3.34
Step 142,000: speedup: 2.86×, τ: 3.55
(Training curves attached: train/accuracy and train/loss. They look stable overall, and the improvement trend flattens later.)

Additionally, is this task strongly correlated with the dataset? Is multi-dataset training necessary, or are there other training techniques? Welcome everyone to participate in the discussion.
Related resources
Here's my training shell template
`set -e
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
export TORCHINDUCTOR_CACHE_DIR=$ROOT_DIR/cache/compiled_kernels
export SPECFORGE_DATA_NUM_PROC=32
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NUM_GPUS=${1:-8}
ATTENTION_BACKEND=${2:-flex_attention}
torchrun
--standalone
--nproc_per_node $NUM_GPUS
$ROOT_DIR/scripts/train_dflash.py
--target-model-path /models/Qwen3-8B
--draft-config-path $ROOT_DIR/configs/qwen3-8b-dflash.json
--train-data-path $ROOT_DIR/cache/dataset/ultrachat_train_regen_nonthinking.jsonl
--output-dir $ROOT_DIR/outputs/qwen3-8b-dflash-perfectblend-v2
--num-epochs 20
--batch-size 2
--accumulation-steps 2
--learning-rate 6e-4
--warmup-ratio 0.04
--max-grad-norm 1.0
--max-length 3072
--chat-template qwen
--attention-backend $ATTENTION_BACKEND
--random-anchor
--num-anchors 512
--loss-decay-gamma 7.0
--log-interval 50
--save-interval 1000
--cache-dir $ROOT_DIR/cache
--report-to tensorboard`