Feature/add translategemma by tanhaow · Pull Request #44 · Princeton-CDH/muse

tanhaow · 2026-03-03T19:47:47Z

Associated Issue(s): resolves #43

Changes in this PR

Add gemma_translate() function with google/translategemma-4b-it as default model
Call get_max_new_tokens() as required
Update translate() router to support model="gemma"
Add HuggingFace authentication documentation to DEVELOPERNOTES.md

Notes

Requires HuggingFace authentication and license acceptance
Tested with Spanish, Chinese, Japanese ↔ English translations

Reviewer Checklist

Verify get_max_new_tokens() is called correctly
Test with at least one language pair (e.g., PYTHONPATH=src .venv/bin/python -c "from muse.translation.translate import translate; print(translate('gemma', 'es', 'en', 'Hola mundo'))")

- Add gemma_translate() function with google/translategemma-4b-it as default - Implement language validation for 55 supported languages - Call get_max_new_tokens() as required by issue - Add HuggingFace authentication error handling with clear messages - Update translate() router to support model='gemma' - Add comprehensive HF authentication documentation to DEVELOPERNOTES.md Closes #43

laurejt

This is looking pretty good, just needs a few changes before it's ready to go.

Revise the developer notes making sure that our documentation is consistent with HuggingFace's.
Remove gemma_langs.py since we don't need it for the current use case. The model will throw an error if an unsupported language is requested.
Update gemma_translate method to (1) remove language validation, (2) update the automodel, (3) update model input handling

laurejt · 2026-03-04T19:05:18Z

src/muse/translation/gemma_langs.py

We don't need this since TranslateGemma accepts ISO-639-1 language codes. We need these indices for NLLB and HY-MT 1.5 because we need a mapping from ISO-639-1 language codes to whatever form the model is expecting for that language.

laurejt · 2026-03-04T19:06:29Z

src/muse/translation/translate.py

+    # Validate input languages
+    if src_lang not in gemma_lang_idx:
+        raise ValueError(f"Source language '{src_lang}' is not supported")
+    if tgt_lang not in gemma_lang_idx:
+        raise ValueError(f"Target language '{tgt_lang}' is not supported")


Remove this validation. According to the model card, an error will be thrown if an unsupported language code is provided.

docs/DEVELOPERNOTES.md

laurejt · 2026-03-05T14:22:10Z

docs/DEVELOPERNOTES.md

+
+1. Visit [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
+2. Click "New token" and select "Read" access type
+3. Copy the token and use it with `huggingface-cli login`


Noting again this was an invalid command at least for me. I needed to use hf auth login.

laurejt · 2026-03-05T14:39:06Z

src/muse/translation/translate.py

+            LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
+            LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)


This model is actually an ImageTextToText model, so we need to load a different type of model. In this case we can still use the AutoTokenizer (which loads the Gemma3 Tokenizer) because its input is compatible with the TranslateGemma models.

Suggested change

LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)

LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)

LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)

LOADED_MODEL["model"] = AutoModelForImageTextToText.from_pretrained(model_name)

laurejt · 2026-03-05T16:12:02Z

src/muse/translation/translate.py

+    tokenized_chat = tokenizer.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        return_tensors="pt",
+    )
+    input_len = tokenized_chat[0].size()[0]
+    if verbose:
+        print(f"Input length: {input_len} tokens")
+
+    # Generate translation
+    if verbose:
+        start = timer()
+    outputs = model.generate(
+        tokenized_chat.to(model.device),


Model raises some warnings about input attention masks and padding, so follow the example from the model card instead.

Suggested change

tokenized_chat = tokenizer.apply_chat_template(

messages,

tokenize=True,

add_generation_prompt=True,

return_tensors="pt",

)

input_len = tokenized_chat[0].size()[0]

if verbose:

print(f"Input length: {input_len} tokens")

# Generate translation

if verbose:

start = timer()

outputs = model.generate(

tokenized_chat.to(model.device),

tokenized_chat = tokenizer.apply_chat_template(

messages,

tokenize=True,

add_generation_prompt=True,

return_dict=True,

return_tensors="pt",

).to(model.device)

input_len = len(tokenized_chat["input_ids"][0])

if verbose:

print(f"Input length: {input_len} tokens")

# Generate translation

if verbose:

start = timer()

outputs = model.generate(

**tokenized_chat,

laurejt · 2026-03-05T16:20:31Z

src/muse/translation/translate.py

    AutoTokenizer,
 )

+from muse.translation.gemma_langs import lang_index as gemma_lang_idx


Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

laurejt

This is ready to go after the latest, code-breaking revisions to translate_hymt are removed.

laurejt · 2026-03-06T15:39:10Z

src/muse/translation/translate.py

+    input_len = len(tokenized_chat["input_ids"][0])
    if verbose:
        print(f"Input length: {input_len} tokens")

    # Generate translation
    if verbose:
        start = timer()
    outputs = model.generate(
-        tokenized_chat.to(model.device),
+        **tokenized_chat,


Undo these changes because they cause the code to break. These changes are not compatible with HYMT-1.5 models.

Suggested change

input_len = len(tokenized_chat["input_ids"][0])

if verbose:

print(f"Input length: {input_len} tokens")

# Generate translation

if verbose:

start = timer()

outputs = model.generate(

tokenized_chat.to(model.device),

**tokenized_chat,

input_len = tokenized_chat[0].size()[0]

if verbose:

print(f"Input length: {input_len} tokens")

# Generate translation

if verbose:

start = timer()

outputs = model.generate(

tokenized_chat.to(model.device),

Although HYMT-1.5 and TranslateGemma both use the apply_chat_template tokenizer method, the model's themselves have different input requirements. Additionally, the tokenized_chat variables produced in hymt_translate and gemma_translate are not the same type and so L106 cannot be the same.

Thanks for catching this!

…eton-CDH/muse into feature/add-translategemma

tanhaow added 4 commits March 3, 2026 12:45

Update DEVELOPERNOTES.md

122169b

Update DEVELOPERNOTES.md

445ad17

Update translate.py

33ce69b

tanhaow requested a review from laurejt March 3, 2026 19:48

tanhaow added 3 commits March 3, 2026 14:51

Update translate.py

9810583

fix linter error

428ae6e

Update translate.py

21e1ef6

laurejt requested changes Mar 5, 2026

View reviewed changes

tanhaow and others added 3 commits March 5, 2026 12:51

Update docs/DEVELOPERNOTES.md

47133e1

Co-authored-by: Laure Thompson <602628+laurejt@users.noreply.github.com>

revise per @laurejt review

13a8e22

ruff format

052883b

tanhaow requested a review from laurejt March 6, 2026 14:30

Apply suggestion from @laurejt

0c90dae

laurejt approved these changes Mar 6, 2026

View reviewed changes

tanhaow added 2 commits March 6, 2026 11:08

Update translate.py

9822266

Merge branch 'feature/add-translategemma' of https://github.com/Princ…

a462aaa

…eton-CDH/muse into feature/add-translategemma

tanhaow merged commit 71ed40b into develop Mar 6, 2026
1 check passed

tanhaow deleted the feature/add-translategemma branch March 6, 2026 16:14

		LOADED_MODEL["tokenizer"] = AutoTokenizer.from_pretrained(model_name)
		LOADED_MODEL["model"] = AutoModelForCausalLM.from_pretrained(model_name)

Conversation

tanhaow commented Mar 3, 2026 • edited by laurejt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this PR

Notes

Reviewer Checklist

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurejt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurejt Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanhaow commented Mar 3, 2026 •

edited by laurejt

Loading

laurejt Mar 6, 2026 •

edited

Loading