Hi I am trying to train on a custom dataset, after writing my config and dataset file, when i try to train the baseline model using :
python3 train.py --config_file configs/SportMOT/vit_base.yml MODEL.DEVICE_ID "('0')"
i get:
===========building transformer===========
using soft triplet loss for training
2025-08-12 20:14:50,259 transreid.train INFO: start training
/space/users//TransReID/processor/processor.py:41: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
scaler = amp.GradScaler()
/space/users//TransReID/processor/processor.py:57: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with amp.autocast(enabled=True):
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "/space/users//TransReID/train.py", line 76, in
do_train(
File "/space/users//TransReID/processor/processor.py", line 59, in do_train
loss = loss_fn(score, feat, target, target_cam)
File "/space/users//TransReID/loss/make_loss.py", line 69, in loss_func
TRI_LOSS = triplet(feat, target)[0]
File "/space/users//TransReID/loss/triplet_loss.py", line 124, in call
dist_mat = euclidean_dist(global_feat, global_feat)
File "/space/users//TransReID/loss/triplet_loss.py", line 25, in euclidean_dist
xx = torch.pow(x, 2).sum(1, keepdim=True).expand(m, n)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Any inputs are appreciated, Thanks!
Hi I am trying to train on a custom dataset, after writing my config and dataset file, when i try to train the baseline model using :
python3 train.py --config_file configs/SportMOT/vit_base.yml MODEL.DEVICE_ID "('0')"
i get:
===========building transformer===========
using soft triplet loss for training
2025-08-12 20:14:50,259 transreid.train INFO: start training
/space/users//TransReID/processor/processor.py:41: FutureWarning:
torch.cuda.amp.GradScaler(args...)is deprecated. Please usetorch.amp.GradScaler('cuda', args...)instead.scaler = amp.GradScaler()
/space/users//TransReID/processor/processor.py:57: FutureWarning:
torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...)instead.with amp.autocast(enabled=True):
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [16,0,0] Assertion
t >= 0 && t < n_classesfailed.../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [17,0,0] Assertion
t >= 0 && t < n_classesfailed.../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [18,0,0] Assertion
t >= 0 && t < n_classesfailed.../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [19,0,0] Assertion
t >= 0 && t < n_classesfailed.Traceback (most recent call last):
File "/space/users//TransReID/train.py", line 76, in
do_train(
File "/space/users//TransReID/processor/processor.py", line 59, in do_train
loss = loss_fn(score, feat, target, target_cam)
File "/space/users//TransReID/loss/make_loss.py", line 69, in loss_func
TRI_LOSS = triplet(feat, target)[0]
File "/space/users//TransReID/loss/triplet_loss.py", line 124, in call
dist_mat = euclidean_dist(global_feat, global_feat)
File "/space/users//TransReID/loss/triplet_loss.py", line 25, in euclidean_dist
xx = torch.pow(x, 2).sum(1, keepdim=True).expand(m, n)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with
TORCH_USE_CUDA_DSAto enable device-side assertions.Any inputs are appreciated, Thanks!