Skip to content

Command to reproduce results on MuST-C failing #4

@Chaitanya-git

Description

@Chaitanya-git

Running the command provided in the readme to reproduce the results on MuST-C of the paper "Adapting Transformer to End-to-End Spoken Language Translation" results in the following error:

| distributed init (rank 1): tcp://localhost:18735
| distributed init (rank 0): tcp://localhost:18735
| distributed init (rank 2): tcp://localhost:18735
| distributed init (rank 3): tcp://localhost:18735
Namespace(adam_betas='(0.9, 0.999)', adam_eps=1e-08, arch='speechconvtransformer_big', attention_dropout=0.1, attn_2d=True, audio_input=True, bucket_cap_mb=150, clip_norm=20.0, criterion='label_smoothed_cross_entropy', data=['bin/'], ddp_backend='no_c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, decoder_out_embed_dim=512, decoder_output_dim=512, device_id=0, distance_penalty='gauss', distributed_backend='nccl', distributed_init_host='localhost', distributed_init_method='tcp://localhost:18735', distributed_port=18736, distributed_rank=0, distributed_world_size=4, dropout=0.1, encoder_attention_heads=8, encoder_convolutions='[(64, 3, 3)] * 2', encoder_embed_dim=512, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=True, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_window=None, freeze_encoder=False, init_variance=1.0, keep_interval_updates=-1, label_smoothing=0.1, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.005], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=100, max_sentences=8, max_sentences_valid=8, max_source_positions=1400, max_target_positions=300, max_tokens=12000, max_update=0, min_loss_scale=0.0001, min_lr=1e-08, momentum=0.99, no_attn_2d=False, no_cache_source=False, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, normalization_constant=1.0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.1, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='models', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=True, skip_invalid_size_inputs_valid_test=True, source_lang=None, target_lang=None, task='translation', train_subset='train', update_freq=[16], upsample_primary=1, valid_subset='valid', validate_interval=1, warmup_init_lr=0.0003, warmup_updates=4000, weight_decay=0.0)
| [h5] dictionary: 4 types
| [de] dictionary: 192 types
| bin/ train 229703 examples
| bin/ valid 1423 examples
Exception ignored in: <function IndexedDataset.__del__ at 0x7f0de0de5790>
Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/indexed_dataset.py", line 85, in __del__
Traceback (most recent call last):
  File "../../train.py", line 365, in <module>
Exception ignored in: <function IndexedDataset.__del__ at 0x7f9f0b8f3790>
Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/indexed_dataset.py", line 85, in __del__
    def __del__(self):
KeyboardInterrupt: 
    multiprocessing_main(args)
    def __del__(self):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 42, in main
KeyboardInterrupt: 
    p.join()
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/amit/.pyenv/versions/3.8.2/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 84, in signal_handler
    raise Exception(msg)
Exception: 

-- Tracebacks above this line can probably be ignored --

Traceback (most recent call last):
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/multiprocessing_train.py", line 48, in run
    single_process_main(args)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/train.py", line 53, in main
    dummy_batch = task.dataset('train').get_dummy_batch(args.max_tokens, max_positions)
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/language_pair_dataset.py", line 221, in get_dummy_batch
    return self.collater([
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/language_pair_dataset.py", line 224, in <listcomp>
    'source': self.src_dict.dummy_sentence(src_len) if self.src_dict is not None else None,
  File "/home/amit/amit/pruning/FBK-Fairseq-ST/fairseq/data/dictionary.py", line 302, in dummy_sentence
    t = torch.Tensor(length).new_empty((length, self.audio_features)).uniform_(self.nspecial + 1, len(self))
RuntimeError: Expected a_in <= b_in to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)

I've tried running the command with Python 3.5 and Python 3.8 and I get the same error both times
I believe the error is caused because the parameters being passed to torch::nn::init::uniform_ are incorrect.

I tried fixing the error myself by changing self.nspecial + 1 to self.nspecial in the following line

t = torch.Tensor(length).new_empty((length, self.audio_features)).uniform_(self.nspecial + 1, len(self))

Is this a valid fix?

Thanks in advance,
Chaitanya

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions