Skip to content
8 changes: 5 additions & 3 deletions README.md
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the setup notes to DEVELOPERNOTES.md

Update instructions to use gcloud: https://docs.cloud.google.com/translate/docs/authentication#client-libs

Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

This repository contains in-progress experimental research software for the CDH project [MuSE (Multilingual Semantic Embeddings)](https://cdh.princeton.edu/projects/muse/).

For developer setup instructions, including Google Cloud Translation configuration, see [docs/DEVELOPERNOTES.md](docs/DEVELOPERNOTES.md).

## Phase 1

The first phase of the project, we will assess how well off-the-shelf multilingual translation models perform in the music-theoretical domain.
Expand All @@ -10,11 +12,11 @@ The first phase of the project, we will assess how well off-the-shelf multilingu

Three models will be evaluated: a commercial state-of-the-art model and two open-weights models available on 🤗 Hugging Face.

1. **TTLM**. Googles [Translation LLM (TTLM) model](https://docs.cloud.google.com/translate/docs/translation-llm) available through Google Cloud Translation.
1. **TTLM**. Google's [Translation LLM (TTLM) model](https://docs.cloud.google.com/translate/docs/translation-llm) available through Google Cloud Translation.

2. **HY-MT1.5**. Tencents Hunyuan Translation Model Version 1.5. We use the [1.8B parameter model](https://huggingface.co/tencent/HY-MT1.5-1.8B).
2. **HY-MT1.5**. Tencent's Hunyuan Translation Model Version 1.5. We use the [1.8B parameter model](https://huggingface.co/tencent/HY-MT1.5-1.8B).

3. **MADLAD-400**. Googles MADLAD-400 translation model that supports over 400 languages. We use the [3B parameter model](https://huggingface.co/google/madlad400-3b-mt).
3. **MADLAD-400**. Google's MADLAD-400 translation model that supports over 400 languages. We use the [3B parameter model](https://huggingface.co/google/madlad400-3b-mt).

### Software Pipeline

Expand Down
24 changes: 24 additions & 0 deletions docs/DEVELOPERNOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Developer Notes

## Google Cloud Translation Setup

The MUSE project supports Google Cloud's Translation LLM (TLLM) model for machine translation. This requires Google Cloud CLI (gcloud) setup and authentication.

### Prerequisites

1. **Install Google Cloud CLI**

- Follow instructions at: https://cloud.google.com/sdk/docs/install
- Verify installation: `gcloud --version`

2. **Authenticate with Application Default Credentials**

```bash
gcloud auth application-default login
```

3. **Set required environment variables**

```bash
export GOOGLE_CLOUD_PROJECT="cdh-muse"
```
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ dependencies = [
"numpy",
"ipython", # Required by transformers for trainer functionality
"transformers[torch, sentencepiece, tiktoken]",
"google-cloud-translate",
"orjsonl",
"ftfy",
]
Expand Down
84 changes: 78 additions & 6 deletions src/muse/translation/translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@

The translate() function provides a unified interface for translating text across
multiple models. Model-specific functions (hymt_translate, nllb_translate,
madlad_translate) are also available for direct use.
madlad_translate, google_cloud_translate) are also available for direct use.
"""

import os
from timeit import default_timer as timer

from google.cloud import translate_v3
from transformers import (
AutoModelForCausalLM,
AutoModelForSeq2SeqLM,
Expand All @@ -28,6 +30,7 @@
"tencent/HY-MT1.5-7B": "hymt",
"facebook/nllb-200-3.3B": "nllb",
"google/madlad400-7b-mt": "madlad",
"google/translation-llm": "google_cloud",
}


Expand Down Expand Up @@ -217,6 +220,72 @@ def madlad_translate(
return tr_text


def google_cloud_translate(
src_lang: str,
tgt_lang: str,
text: str,
verbose: bool = False,
) -> str:
"""
Translate text using Google Cloud Translate API with Translation LLM (TLLM) model.
Languages are specified with their ISO 639-1 codes (e.g., "zh", "ja", "es", "en").

Requires gcloud CLI authentication. See docs/DEVELOPERNOTES.md for setup.

Args:
src_lang: Source language ISO 639-1 code
tgt_lang: Target language ISO 639-1 code
text: Text to translate from source to target language
verbose: If True, print timing information

Returns:
Translated text as a string

Raises:
ValueError: If GOOGLE_CLOUD_PROJECT environment variable is not set
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite to describe assumed gcloud usage

project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
if not project_id:
raise ValueError(
"GOOGLE_CLOUD_PROJECT environment variable is not set. "
"Set it with: export GOOGLE_CLOUD_PROJECT='cdh-muse'"
)

region = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

if verbose:
start = timer()

client = translate_v3.TranslationServiceClient()

if verbose:
print(
f"Initialized Google Cloud Translate client in {timer() - start:.2f} seconds"
)

parent = f"projects/{project_id}/locations/{region}"
model_path = f"{parent}/models/general/translation-llm"

if verbose:
start = timer()

response = client.translate_text(
contents=[text],
target_language_code=tgt_lang,
source_language_code=src_lang,
parent=parent,
model=model_path,
mime_type="text/plain",
)

if verbose:
print(f"Received translation response in {timer() - start:.2f} seconds")

translated_text = response.translations[0].translated_text

return translated_text


def translate(
model: str,
src_lang: str,
Expand All @@ -225,23 +294,24 @@ def translate(
verbose: bool = False,
) -> str:
"""
Translate text using a specified HuggingFace translation model. This function
provides a unified interface for translating text across multiple translation
models by routing to the appropriate model-specific implementation based on the
model parameter.
Translate text using a specified translation model. This function provides a
unified interface for translating text across multiple translation models by
routing to the appropriate model-specific implementation based on the model
parameter.

Supported models:
- tencent/HY-MT1.5-7B: Tencent's Hunyuan Translation Model v1.5 (7B)
- facebook/nllb-200-3.3B: Meta's No Language Left Behind (3.3B)
- google/madlad400-7b-mt: Google's MADLAD-400 (7B)
- google/translation-llm: Google Cloud Translation LLM (TLLM)

Languages are specified using ISO 639-1 codes (e.g., "zh", "ja", "es", "en").
Language validation is delegated to the model-specific functions, so supported
languages vary by model. The MADLAD model does not use the source language
parameter internally, but it is accepted for API consistency.

Args:
model: HuggingFace model identifier (must be one of the supported models)
model: Model identifier (must be one of the supported models)
src_lang: Source language ISO 639-1 code
tgt_lang: Target language ISO 639-1 code
text: Text to translate from source to target language
Expand Down Expand Up @@ -269,6 +339,8 @@ def translate(
elif model_type == "madlad":
# MADLAD does not use src_lang parameter
return madlad_translate(tgt_lang, text, model, verbose)
elif model_type == "google_cloud":
return google_cloud_translate(src_lang, tgt_lang, text, verbose=verbose)
else:
# This should never happen if SUPPORTED_MODELS is correctly maintained
raise ValueError(f"Unknown model type: {model_type}")
12 changes: 6 additions & 6 deletions src/muse/translation/translate_corpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
Generate machine translation corpus from parallel text corpus.

This script processes a parallel text corpus (JSONL format) and generates
machine translations using HuggingFace models. For each input record, it
machine translations using supported translation models. For each input record, it
produces two translations: original→English and English→original.

Each output record represents a single translation with fields: tr_id, pair_id,
Expand Down Expand Up @@ -35,7 +35,7 @@ def validate_model(model: str) -> None:
Validate that the specified model is supported.

Args:
model: HuggingFace model identifier
model: Model identifier

Raises:
ValueError: If model is not supported
Expand All @@ -61,7 +61,7 @@ def generate_translation_record(

Args:
pair_id: ID of the source parallel text pair
model: HuggingFace model identifier
model: Model identifier
src_lang: Source language ISO 639-1 code
tgt_lang: Target language ISO 639-1 code
src_text: Source text to translate
Expand Down Expand Up @@ -111,7 +111,7 @@ def generate_translations(

Args:
input_path: Path to input parallel corpus JSONL file
model: HuggingFace model identifier
model: Model identifier
verbose: If True, print timing and token information during translation

Yields:
Expand Down Expand Up @@ -186,7 +186,7 @@ def save_translated_corpus(
Args:
input_path: Path to input parallel corpus JSONL file
output_path: Path to output machine translation corpus JSONL file
model: HuggingFace model identifier
model: Model identifier
verbose: If True, print timing and token information during translation
"""
# Count total records for progress bar
Expand Down Expand Up @@ -224,7 +224,7 @@ def main():
)
args.add_argument(
"model",
help="HuggingFace model identifier (e.g., tencent/HY-MT1.5-7B)",
help="Model identifier (e.g., tencent/HY-MT1.5-7B, google/translation-llm)",
)
args.add_argument(
"input", type=pathlib.Path, help="Input parallel corpus JSONL file"
Expand Down
Loading