同样的config,actor/ref会占用特别庞大的显存。以下是我复现examples中的tir用的config (actor和ref都开启了fsdp.offload_params),使用Qwen3-4B能顺利在3090上运行。但是,模型换Qwen3.5-4B后,ref和actor峰值能分别占用接近50GB的显存,导致在H100上会OOM(update actor时候)。不知道是哪里出问题了。我观察GPU利用,发现是ref 的显存占用一直保持在峰值,似乎是没有执行offload? actor显存变化则会动态变化。以下是用到的config
experiment_name: gsm8k-grpo
trial_name: tir-math-reasoning
cluster:
fileroot: tmp/areal/experiments
n_nodes: 1
n_gpus_per_node: 2
name_resolve:
type: nfs
nfs_record_root: tmp/areal/name_resolve
seed: 1
enable_offload: false
total_train_epochs: 3
tokenizer_path: ${actor.path}
scheduler:
type: local
rollout:
backend: "sglang:d1p1t1"
experiment_name: ${experiment_name}
trial_name: ${trial_name}
max_concurrent_rollouts: 16
queue_size: null
consumer_batch_size: ${train_dataset.batch_size}
max_head_offpolicyness: 2
enable_rollout_tracing: false
scheduling_spec: ${actor.scheduling_spec}
fileroot: ${cluster.fileroot}
tokenizer_path: ${tokenizer_path}
dump_to_file: true
gconfig:
n_samples: 4
min_new_tokens: 0
max_new_tokens: 512
greedy: false
temperature: 1.0
actor:
backend: "fsdp:d1p1t1"
experiment_name: ${experiment_name}
trial_name: ${trial_name}
path: ckpts/Qwen3.5-4B
init_from_scratch: false
disable_dropout: true
gradient_checkpointing: true
dtype: bfloat16
fsdp:
offload_params: true
per_layer_optim_step: true
mb_spec:
max_tokens_per_mb: 4096
optimizer:
type: adam
lr: 1.70e-5
weight_decay: 0.017
beta1: 0.9
beta2: 0.999
eps: 1e-8
lr_scheduler_type: constant
gradient_clipping: 1.0
warmup_steps_proportion: 0.001
eps_clip: 0.4
temperature: ${gconfig.temperature}
reward_scaling: 10.0
reward_bias: -0.5
kl_ctl: 0.001
ppo_n_minibatches: 1
recompute_logprob: true
use_decoupled_loss: true
rejection_sampling:
metric: ratio
upper: 5.0
reward_norm:
mean_level: group
std_level: group
group_size: ${gconfig.n_samples}
adv_norm:
mean_level: batch
std_level: batch
max_new_tokens: ${gconfig.max_new_tokens}
scheduling_spec:
- task_type: worker
port_count: 2
gpu: 1
mem: 32
cmd: python3 -m areal.infra.rpc.rpc_server
env_vars: {}
ref:
backend: ${actor.backend}
experiment_name: ${experiment_name}
trial_name: ${trial_name}
path: ${actor.path}
init_from_scratch: false
disable_dropout: true
dtype: ${actor.dtype}
mb_spec:
max_tokens_per_mb: 5120
optimizer: null
scheduling_strategy:
type: colocation
target: actor
scheduling_spec: ${actor.scheduling_spec}
fsdp:
offload_params: true
sglang:
model_path: ${actor.path}
random_seed: ${seed}
skip_tokenizer_init: false
dtype: ${actor.dtype}
allow_auto_truncate: true
max_running_requests: null
context_length: 16382
mem_fraction_static: 0.8
attention_backend: flashinfer
vllm:
model: ${actor.path}
seed: ${seed}
skip_tokenizer_init: false
dtype: ${actor.dtype}
max_model_len: 16382
gpu_memory_utilization: 0.9
tir:
max_turns: 2
max_length: 3000
tool_timeout: 30
enable_tools: python;calculator
train_dataset:
batch_size: 4
shuffle: true
pin_memory: true
num_workers: 4
path: openai/gsm8k
type: rl
max_length: 500
valid_dataset:
batch_size: 2
pin_memory: true
num_workers: 4
path: openai/gsm8k
type: rl
saver:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: 1
freq_steps: null
freq_secs: null
recover:
mode: disabled
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: null
freq_steps: null
freq_secs: null
evaluator:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
freq_epochs: 1
freq_steps: null
freq_secs: null
stats_logger:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
wandb:
mode: online
perf_tracer:
experiment_name: ${experiment_name}
trial_name: ${trial_name}
fileroot: ${cluster.fileroot}
enabled: false
session_tracer:
enabled: false
同样的config,actor/ref会占用特别庞大的显存。以下是我复现examples中的tir用的config (actor和ref都开启了fsdp.offload_params),使用Qwen3-4B能顺利在3090上运行。但是,模型换Qwen3.5-4B后,ref和actor峰值能分别占用接近50GB的显存,导致在H100上会OOM(update actor时候)。不知道是哪里出问题了。我观察GPU利用,发现是ref 的显存占用一直保持在峰值,似乎是没有执行offload? actor显存变化则会动态变化。以下是用到的config