[ci] chore: npu ci use cann9.0.0#6520
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the Ascend CI documentation to use a newer vLLM image version, configures the GRPO Qwen3-8B test script to calculate log probabilities, and simplifies the output log path for the PPO Qwen3-8B test script. A review comment points out that the log directory for the PPO script might not exist, which could cause the tee command to fail and result in lost CI logs. It is recommended to ensure the directory is created prior to writing the log.
| trainer.max_actor_ckpt_to_keep=1 \ | ||
| trainer.max_critic_ckpt_to_keep=1 \ | ||
| trainer.total_training_steps=15 2>&1 | tee /root/.cache/nightly_log/qwen3-8b-ppo/ppo_qwen3-8b_fsdp_npu-$(date +%Y%m%d_%H%M).log No newline at end of file | ||
| trainer.total_training_steps=15 2>&1 | tee /root/.cache/nightly_log/ppo_qwen3_8b/ppo_qwen3-8b_fsdp_npu.log No newline at end of file |
There was a problem hiding this comment.
If the directory /root/.cache/nightly_log/ppo_qwen3_8b does not exist, the tee command will fail to write the log file, resulting in lost CI logs. To prevent this, ensure the directory is created before writing the log. Additionally, redirect stderr to stdout using 2>&1 before piping to tee to ensure that error messages are captured in the log file.
| trainer.total_training_steps=15 2>&1 | tee /root/.cache/nightly_log/ppo_qwen3_8b/ppo_qwen3-8b_fsdp_npu.log | |
| trainer.total_training_steps=15 2>&1 | { mkdir -p /root/.cache/nightly_log/ppo_qwen3_8b && tee /root/.cache/nightly_log/ppo_qwen3_8b/ppo_qwen3-8b_fsdp_npu.log; } |
References
- Ensure that the log directory exists before writing to it in shell scripts.
- In shell scripts, redirect stderr to stdout using
2>&1before piping toteeto ensure that error messages are captured in the log file.
| permissions: | ||
| contents: read | ||
|
|
||
There was a problem hiding this comment.
npu unit test加上
import verl.utils.device as device_module
monkeypatch.setattr(device_module, "is_npu_available", False)
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.