Windows detected, using asyncio.WindowsSelectorEventLoopPolicy
starting
input_dir: input
Downloading model to .cache/model_base_caption_capfilt_large.pth... please wait
Model cached to: .cache/model_base_caption_capfilt_large.pth
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████████| 232k/232k [00:00<00:00, 6.17MB/s]
Downloading (…)okenizer_config.json: 100%|██████████████████████████████| 28.0/28.0 [00:00<00:00, 14.0kB/s]
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████| 570/570 [00:00<00:00, 228kB/s]
load checkpoint from .cache/model_base_caption_capfilt_large.pth
loading model to cuda
working image: input\00012-1722407061-gigapixel-standard-height-1024px.jpg
Traceback (most recent call last):
File ".\scripts\auto_caption.py", line 217, in <module>
asyncio.run(main(opt))
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\asyncio\runners.py", line 43, in run
return loop.run_until_complete(main)
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\asyncio\base_events.py", line 608, in run_until_complete
return future.result()
File ".\scripts\auto_caption.py", line 157, in main
captions = blip_decoder.generate(image, sample=sample, num_beams=16, min_length=opt.min_length, \
File "scripts/BLIP\models\blip.py", line 156, in generate
outputs = self.text_decoder.generate(input_ids=input_ids,
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\transformers\generation\utils.py", line 1524, in generate
return self.beam_search(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\transformers\generation\utils.py", line 2810, in beam_search
outputs = self(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 886, in forward
outputs = self.bert(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 781, in forward
encoder_outputs = self.encoder(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 445, in forward
layer_outputs = layer_module(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 361, in forward
cross_attention_outputs = self.crossattention(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 277, in forward
self_outputs = self.self(
File "C:\Users\ssuuk\anaconda3\envs\dl\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "scripts/BLIP\models\med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (16) must match the size of tensor b (256) at non-singleton dimension 0
(dl) PS D:\Projekty\EveryDream>```
When trying to run auto caption, the script fails with: