Skip to content

Feature/add translategemma#44

Merged
tanhaow merged 13 commits intodevelopfrom
feature/add-translategemma
Mar 6, 2026
Merged

Feature/add translategemma#44
tanhaow merged 13 commits intodevelopfrom
feature/add-translategemma

Conversation

@tanhaow
Copy link

@tanhaow tanhaow commented Mar 3, 2026

Associated Issue(s): resolves #43

Changes in this PR

  • Add gemma_translate() function with google/translategemma-4b-it as default model
  • Call get_max_new_tokens() as required
  • Update translate() router to support model="gemma"
  • Add HuggingFace authentication documentation to DEVELOPERNOTES.md

Notes

  • Requires HuggingFace authentication and license acceptance
  • Tested with Spanish, Chinese, Japanese ↔ English translations

Reviewer Checklist

  • Verify get_max_new_tokens() is called correctly
  • Test with at least one language pair (e.g., PYTHONPATH=src .venv/bin/python -c "from muse.translation.translate import translate; print(translate('gemma', 'es', 'en', 'Hola mundo'))")

tanhaow added 4 commits March 3, 2026 12:45
- Add gemma_translate() function with google/translategemma-4b-it as default
- Implement language validation for 55 supported languages
- Call get_max_new_tokens() as required by issue
- Add HuggingFace authentication error handling with clear messages
- Update translate() router to support model='gemma'
- Add comprehensive HF authentication documentation to DEVELOPERNOTES.md

Closes #43
@tanhaow tanhaow requested a review from laurejt March 3, 2026 19:48
Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, just needs a few changes before it's ready to go.

  • Revise the developer notes making sure that our documentation is consistent with HuggingFace's.
  • Remove gemma_langs.py since we don't need it for the current use case. The model will throw an error if an unsupported language is requested.
  • Update gemma_translate method to (1) remove language validation, (2) update the automodel, (3) update model input handling

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this since TranslateGemma accepts ISO-639-1 language codes. We need these indices for NLLB and HY-MT 1.5 because we need a mapping from ISO-639-1 language codes to whatever form the model is expecting for that language.

Comment on lines +266 to +270
# Validate input languages
if src_lang not in gemma_lang_idx:
raise ValueError(f"Source language '{src_lang}' is not supported")
if tgt_lang not in gemma_lang_idx:
raise ValueError(f"Target language '{tgt_lang}' is not supported")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this validation. According to the model card, an error will be thrown if an unsupported language code is provided.


1. Visit [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Click "New token" and select "Read" access type
3. Copy the token and use it with `huggingface-cli login`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting again this was an invalid command at least for me. I needed to use hf auth login.

Comment on lines +279 to +280
LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model is actually an ImageTextToText model, so we need to load a different type of model. In this case we can still use the AutoTokenizer (which loads the Gemma3 Tokenizer) because its input is compatible with the TranslateGemma models.

Suggested change
LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)
LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
LOADED_MODEL["model"] = AutoModelForImageTextToText.from_pretrained(model_name)

Comment on lines +317 to +331
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
input_len = tokenized_chat[0].size()[0]
if verbose:
print(f"Input length: {input_len} tokens")

# Generate translation
if verbose:
start = timer()
outputs = model.generate(
tokenized_chat.to(model.device),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model raises some warnings about input attention masks and padding, so follow the example from the model card instead.

Suggested change
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
)
input_len = tokenized_chat[0].size()[0]
if verbose:
print(f"Input length: {input_len} tokens")
# Generate translation
if verbose:
start = timer()
outputs = model.generate(
tokenized_chat.to(model.device),
tokenized_chat = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
input_len = len(tokenized_chat["input_ids"][0])
if verbose:
print(f"Input length: {input_len} tokens")
# Generate translation
if verbose:
start = timer()
outputs = model.generate(
**tokenized_chat,

AutoTokenizer,
)

from muse.translation.gemma_langs import lang_index as gemma_lang_idx
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

tanhaow and others added 3 commits March 5, 2026 12:51
Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>
@tanhaow tanhaow requested a review from laurejt March 6, 2026 14:30
Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to go after the latest, code-breaking revisions to translate_hymt are removed.

Comment on lines +106 to +114
input_len = len(tokenized_chat["input_ids"][0])
if verbose:
print(f"Input length: {input_len} tokens")

# Generate translation
if verbose:
start = timer()
outputs = model.generate(
tokenized_chat.to(model.device),
**tokenized_chat,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo these changes because they cause the code to break. These changes are not compatible with HYMT-1.5 models.

Suggested change
input_len = len(tokenized_chat["input_ids"][0])
if verbose:
print(f"Input length: {input_len} tokens")
# Generate translation
if verbose:
start = timer()
outputs = model.generate(
tokenized_chat.to(model.device),
**tokenized_chat,
input_len = tokenized_chat[0].size()[0]
if verbose:
print(f"Input length: {input_len} tokens")
# Generate translation
if verbose:
start = timer()
outputs = model.generate(
tokenized_chat.to(model.device),

Copy link

@laurejt laurejt Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although HYMT-1.5 and TranslateGemma both use the apply_chat_template tokenizer method, the model's themselves have different input requirements. Additionally, the tokenized_chat variables produced in hymt_translate and gemma_translate are not the same type and so L106 cannot be the same.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this!

@tanhaow tanhaow merged commit 71ed40b into develop Mar 6, 2026
1 check passed
@tanhaow tanhaow deleted the feature/add-translategemma branch March 6, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants