Skip to content

Error while loading Mixtral-8x7B-Instruct-v0.1 tokenizer with AutoTokenizer.from_pretrained #134

@will-johnstunning

Description

@will-johnstunning

It looks like the latest transformers version supported is 4.36, but as seen from this issue in transformers, we need to be able to use a newer version of transformers to correctly load in the model.

    estimator = HuggingFace(
        entry_point='sagemaker_train.py',
        source_dir=str(Path()),
        instance_type=instance_configs[strategy]["instance_type"],
        instance_count=instance_configs[strategy]["instance_count"],
        role=role,
        transformers_version='4.36.0',
        pytorch_version='2.1.0',
        py_version='py310',
        hyperparameters=hyperparameters,
        debugger_hook_config=False,
        disable_profiler=True,
        max_run=24 * 3600,  # 24 hours max
        keep_alive_period_in_seconds=1800,  
        environment={
            'HF_TOKEN': huggingface_token,
        }
    )

  estimator.fit(
      inputs={
          'train': f"s3://{s3_config.bucket}/{s3_config.data_prefix}/train.csv",
          'validation': f"s3://{s3_config.bucket}/{s3_config.data_prefix}/val.csv",
          'test': f"s3://{s3_config.bucket}/{s3_config.data_prefix}/test.csv"
      },
      wait=False
  )
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")

Traceback (most recent call last):
File "", line 1, in
File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/ruser/py310/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama_fast.py", line 133, in init
super().init(
File "/home/ruser/py310/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 111, in init
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions