Common issues and solutions for Qwen3-Coder-480B-A35B-Instruct installation and usage.
Error: CUDA out of memory
RuntimeError: CUDA out of memory. Tried to allocate X GB (GPU 0; Y GB total capacity)
Solutions:
- Check GPU memory usage:
nvidia-smi - Kill existing processes:
sudo pkill -f python - Restart GPU:
sudo nvidia-smi --gpu-reset - Reduce batch size or use model sharding
- Enable CPU offloading:
model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", offload_folder="./offload", offload_state_dict=True )
Error: numpy.dtype size changed, may indicate binary incompatibility
Solutions:
-
Clean installation:
rm -rf ~/qwen480b_env pip cache purge ./install.sh -
Manual NumPy fix:
source ~/qwen480b_env/bin/activate pip uninstall numpy -y pip install numpy==1.24.4
Error: NVCC not found or CUDA version mismatch
Solutions:
-
Install CUDA 12.1:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb sudo dpkg -i cuda-keyring_1.0-1_all.deb sudo apt update sudo apt install cuda-toolkit-12-1
-
Update PATH:
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc
-
Verify installation:
nvcc --version nvidia-smi
Error: Connection timeout or HTTP 403/404 errors
Solutions:
- Check internet connection:
ping huggingface.co - Use Hugging Face token (if model requires authentication):
export HF_TOKEN="your_token_here"
- Manual download with resume:
git clone https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
- Use mirror servers:
export HF_ENDPOINT="https://hf-mirror.com"
Error: Can't load model or Config file not found
Solutions:
-
Verify model directory:
ls -la ~/qwen480b_env/models/qwen3-coder-480b/ -
Check file permissions:
chmod -R 755 ~/qwen480b_env/models/ -
Re-download corrupted files:
cd ~/qwen480b_env/models/qwen3-coder-480b/ git lfs pull
Issue: Model takes too long to generate responses
Solutions:
-
Enable optimizations:
# Use torch.compile (PyTorch 2.0+) model = torch.compile(model, mode="reduce-overhead") # Use better data types model = model.half() # FP16
-
Adjust generation parameters:
outputs = model.generate( **inputs, max_new_tokens=100, # Reduce if too slow do_sample=False, # Use greedy decoding num_beams=1, # Disable beam search )
-
Use VLLM (if installed):
from vllm import LLM, SamplingParams llm = LLM(model="path/to/model") outputs = llm.generate(prompts, SamplingParams(temperature=0.7))
Issue: Memory usage keeps increasing
Solutions:
-
Clear GPU cache regularly:
import torch torch.cuda.empty_cache()
-
Use context managers:
with torch.no_grad(): outputs = model.generate(**inputs)
-
Delete unused variables:
del outputs, inputs import gc gc.collect()
import logging
logging.basicConfig(level=logging.DEBUG)
# For transformers
import transformers
transformers.logging.set_verbosity_debug()# GPU usage
nvidia-smi -l 1
# Memory usage
htop
# Disk usage
df -h
# Process monitoring
ps aux | grep python# Check model config
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path)
print(config)
# Check model size
import torch
model_size = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {model_size:,}")-
Use FP16 precision:
model = model.half()
-
Enable torch.compile:
model = torch.compile(model)
-
Optimize tokenizer:
tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left"
-
Use model sharding:
model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", torch_dtype=torch.float16, low_cpu_mem_usage=True, )
-
Enable gradient checkpointing:
model.gradient_checkpointing_enable()
-
Use CPU offloading:
model = AutoModelForCausalLM.from_pretrained( model_path, device_map="auto", offload_folder="./offload" )
Before reporting issues, collect this information:
# System info
./scripts/system_check.sh > debug_info.txt
# Python environment
pip list >> debug_info.txt
# GPU info
nvidia-smi >> debug_info.txt
# Error logs
tail -100 ~/qwen480b_install.log >> debug_info.txt- Check existing issues: GitHub Issues
- Create new issue with:
- Clear description of the problem
- Steps to reproduce
- Error messages and logs
- System information from debug_info.txt
- Discussions: GitHub Discussions
- Documentation: Project Wiki
import torch.profiler
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CUDA],
record_shapes=True
) as prof:
outputs = model.generate(**inputs)
print(prof.key_averages().table(sort_by="cuda_memory_usage"))import time
import torch
# Warm up
for _ in range(3):
_ = model.generate(**inputs, max_new_tokens=10)
# Benchmark
times = []
for _ in range(10):
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=50)
times.append(time.time() - start)
print(f"Average time: {sum(times)/len(times):.2f}s")# Test Hugging Face connectivity
curl -v https://huggingface.co
# Test download speed
wget --progress=bar --timeout=30 https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct/resolve/main/README.md
# Check DNS resolution
nslookup huggingface.co