Describe the issue
I'm trying to run the multivariate anomaly detection experiment on the Weather dataset, but I encounter CUDA Out-of-Memory (OOM) error on a 24GB GPU. It seems the model always loads the full DeepSeek-Qwen2 architecture (even when changing DEEPSEEK_PATH to BERT), which exceeds the available memory.
Environment
GPU: RTX 4090D (24GB)
CUDA: 12.4
PyTorch: 2.4.1+cu121
Python: 3.10
OS: Ubuntu 22.04
Steps to reproduce:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.38 GiB.
GPU 0 has a total capacity of 23.52 GiB of which 2.83 GiB is free.
... (loading Qwen2ForCausalLM)
Error message (relevant part):
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.38 GiB.
GPU 0 has a total capacity of 23.52 GiB of which 2.83 GiB is free.
... (loading Qwen2ForCausalLM)
What I tried
Changed DEEPSEEK_PATH from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to bert-base-uncased in ts_benchmark/baselines/utils.py
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Cleared previous results
Observation
The tokenizer downloads successfully from bert-base-uncased, but the model still loads the Qwen2ForCausalLM architecture (deepseek stack), consuming 20+ GB VRAM before OOM.
Questions
Is there a way to run the model using the lightweight BERT architecture instead of DeepSeek?
If DeepSeek is required, could you provide guidance on memory optimization (e.g., reduced batch_size, seq_len, gradient checkpointing, or mixed precision) to fit into 24GB VRAM?
Are there any configuration flags or scripts specifically designed for consumer GPUs (24GB)?
Additional context
Your paper's Table 3 shows BERT achieves competitive performance, so supporting BERT as a lightweight backbone would greatly benefit users without A100/H800 GPUs.
Thank you for your great work!
Describe the issue
I'm trying to run the multivariate anomaly detection experiment on the Weather dataset, but I encounter CUDA Out-of-Memory (OOM) error on a 24GB GPU. It seems the model always loads the full DeepSeek-Qwen2 architecture (even when changing DEEPSEEK_PATH to BERT), which exceeds the available memory.
Environment
GPU: RTX 4090D (24GB)
CUDA: 12.4
PyTorch: 2.4.1+cu121
Python: 3.10
OS: Ubuntu 22.04
Steps to reproduce:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.38 GiB.
GPU 0 has a total capacity of 23.52 GiB of which 2.83 GiB is free.
... (loading Qwen2ForCausalLM)
Error message (relevant part):
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.38 GiB.
GPU 0 has a total capacity of 23.52 GiB of which 2.83 GiB is free.
... (loading Qwen2ForCausalLM)
What I tried
Changed DEEPSEEK_PATH from deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B to bert-base-uncased in ts_benchmark/baselines/utils.py
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Cleared previous results
Observation
The tokenizer downloads successfully from bert-base-uncased, but the model still loads the Qwen2ForCausalLM architecture (deepseek stack), consuming 20+ GB VRAM before OOM.
Questions
Is there a way to run the model using the lightweight BERT architecture instead of DeepSeek?
If DeepSeek is required, could you provide guidance on memory optimization (e.g., reduced batch_size, seq_len, gradient checkpointing, or mixed precision) to fit into 24GB VRAM?
Are there any configuration flags or scripts specifically designed for consumer GPUs (24GB)?
Additional context
Your paper's Table 3 shows BERT achieves competitive performance, so supporting BERT as a lightweight backbone would greatly benefit users without A100/H800 GPUs.
Thank you for your great work!