Princeton-CDH · laurejt · Feb 19, 2026 · Feb 17, 2026 · Feb 17, 2026 · Feb 17, 2026
diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 This repository contains in-progress experimental research software for the CDH project [MuSE (Multilingual Semantic Embeddings)](https://cdh.princeton.edu/projects/muse/).
 
+For developer setup instructions, including Google Cloud Translation configuration, see [docs/DEVELOPERNOTES.md](docs/DEVELOPERNOTES.md).
+
 ## Phase 1
 
 The first phase of the project, we will assess how well off-the-shelf multilingual translation models perform in the music-theoretical domain.
@@ -10,11 +12,11 @@ The first phase of the project, we will assess how well off-the-shelf multilingu
 
 Three models will be evaluated: a commercial state-of-the-art model and two open-weights models available on 🤗 Hugging Face.
 
-1. **TTLM**. Google’s [Translation LLM (TTLM) model](https://docs.cloud.google.com/translate/docs/translation-llm) available through Google Cloud Translation.
+1. **TTLM**. Google's [Translation LLM (TTLM) model](https://docs.cloud.google.com/translate/docs/translation-llm) available through Google Cloud Translation.
 
-2. **HY-MT1.5**. Tencent’s Hunyuan Translation Model Version 1.5. We use the [1.8B parameter model](https://huggingface.co/tencent/HY-MT1.5-1.8B).
+2. **HY-MT1.5**. Tencent's Hunyuan Translation Model Version 1.5. We use the [1.8B parameter model](https://huggingface.co/tencent/HY-MT1.5-1.8B).
 
-3. **MADLAD-400**. Google’s MADLAD-400 translation model that supports over 400 languages. We use the [3B parameter model](https://huggingface.co/google/madlad400-3b-mt).
+3. **MADLAD-400**. Google's MADLAD-400 translation model that supports over 400 languages. We use the [3B parameter model](https://huggingface.co/google/madlad400-3b-mt).
 
 ### Software Pipeline
 

diff --git a/docs/DEVELOPERNOTES.md b/docs/DEVELOPERNOTES.md
@@ -0,0 +1,24 @@
+# Developer Notes
+
+## Google Cloud Translation Setup
+
+The MUSE project supports Google Cloud's Translation LLM (TLLM) model for machine translation. This requires Google Cloud CLI (gcloud) setup and authentication.
+
+### Prerequisites
+
+1. **Install Google Cloud CLI**
+
+   - Follow instructions at: https://cloud.google.com/sdk/docs/install
+   - Verify installation: `gcloud --version`
+
+2. **Authenticate with Application Default Credentials**
+
+   ```bash
+   gcloud auth application-default login
+   ```
+
+3. **Set required environment variables**
+
+   ```bash
+   export GOOGLE_CLOUD_PROJECT="cdh-muse"
+   ```
diff --git a/pyproject.toml b/pyproject.toml
@@ -28,6 +28,7 @@ dependencies = [
   "numpy",
   "ipython", # Required by transformers for trainer functionality
   "transformers[torch, sentencepiece, tiktoken]",
+  "google-cloud-translate",
   "orjsonl",
   "ftfy",
 ]

diff --git a/src/muse/translation/translate.py b/src/muse/translation/translate.py
@@ -5,11 +5,13 @@
 
 The translate() function provides a unified interface for translating text across
 multiple models. Model-specific functions (hymt_translate, nllb_translate,
-madlad_translate) are also available for direct use.
+madlad_translate, google_cloud_translate) are also available for direct use.
 """
 
+import os
 from timeit import default_timer as timer
 
+from google.cloud import translate_v3
 from transformers import (
     AutoModelForCausalLM,
     AutoModelForSeq2SeqLM,
@@ -28,6 +30,7 @@
     "tencent/HY-MT1.5-7B": "hymt",
     "facebook/nllb-200-3.3B": "nllb",
     "google/madlad400-7b-mt": "madlad",
+    "google/translation-llm": "google_cloud",
 }
 
 
@@ -217,6 +220,72 @@ def madlad_translate(
     return tr_text
 
 
+def google_cloud_translate(
+    src_lang: str,
+    tgt_lang: str,
+    text: str,
+    verbose: bool = False,
+) -> str:
+    """
+    Translate text using Google Cloud Translate API with Translation LLM (TLLM) model.
+    Languages are specified with their ISO 639-1 codes (e.g., "zh", "ja", "es", "en").
+
+    Requires gcloud CLI authentication. See docs/DEVELOPERNOTES.md for setup.
+
+    Args:
+        src_lang: Source language ISO 639-1 code
+        tgt_lang: Target language ISO 639-1 code
+        text: Text to translate from source to target language
+        verbose: If True, print timing information
+
+    Returns:
+        Translated text as a string
+
+    Raises:
+        ValueError: If GOOGLE_CLOUD_PROJECT environment variable is not set
+    """
+    project_id = os.environ.get("GOOGLE_CLOUD_PROJECT")
+    if not project_id:
+        raise ValueError(
+            "GOOGLE_CLOUD_PROJECT environment variable is not set. "
+            "Set it with: export GOOGLE_CLOUD_PROJECT='cdh-muse'"
+        )
+
+    region = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")
+
+    if verbose:
+        start = timer()
+
+    client = translate_v3.TranslationServiceClient()
+
+    if verbose:
+        print(
+            f"Initialized Google Cloud Translate client in {timer() - start:.2f} seconds"
+        )
+
+    parent = f"projects/{project_id}/locations/{region}"
+    model_path = f"{parent}/models/general/translation-llm"
+
+    if verbose:
+        start = timer()
+
+    response = client.translate_text(
+        contents=[text],
+        target_language_code=tgt_lang,
+        source_language_code=src_lang,
+        parent=parent,
+        model=model_path,
+        mime_type="text/plain",
+    )
+
+    if verbose:
+        print(f"Received translation response in {timer() - start:.2f} seconds")
+
+    translated_text = response.translations[0].translated_text
+
+    return translated_text
+
+
 def translate(
     model: str,
     src_lang: str,
@@ -225,23 +294,24 @@ def translate(
     verbose: bool = False,
 ) -> str:
     """
-    Translate text using a specified HuggingFace translation model. This function
-    provides a unified interface for translating text across multiple translation
-    models by routing to the appropriate model-specific implementation based on the
-    model parameter.
+    Translate text using a specified translation model. This function provides a
+    unified interface for translating text across multiple translation models by
+    routing to the appropriate model-specific implementation based on the model
+    parameter.
 
     Supported models:
         - tencent/HY-MT1.5-7B: Tencent's Hunyuan Translation Model v1.5 (7B)
         - facebook/nllb-200-3.3B: Meta's No Language Left Behind (3.3B)
         - google/madlad400-7b-mt: Google's MADLAD-400 (7B)
+        - google/translation-llm: Google Cloud Translation LLM (TLLM)
 
     Languages are specified using ISO 639-1 codes (e.g., "zh", "ja", "es", "en").
     Language validation is delegated to the model-specific functions, so supported
     languages vary by model. The MADLAD model does not use the source language
     parameter internally, but it is accepted for API consistency.
 
     Args:
-        model: HuggingFace model identifier (must be one of the supported models)
+        model: Model identifier (must be one of the supported models)
         src_lang: Source language ISO 639-1 code
         tgt_lang: Target language ISO 639-1 code
         text: Text to translate from source to target language
@@ -269,6 +339,8 @@ def translate(
     elif model_type == "madlad":
         # MADLAD does not use src_lang parameter
         return madlad_translate(tgt_lang, text, model, verbose)
+    elif model_type == "google_cloud":
+        return google_cloud_translate(src_lang, tgt_lang, text, verbose=verbose)
     else:
         # This should never happen if SUPPORTED_MODELS is correctly maintained
         raise ValueError(f"Unknown model type: {model_type}")
diff --git a/src/muse/translation/translate_corpus.py b/src/muse/translation/translate_corpus.py
@@ -2,7 +2,7 @@
 Generate machine translation corpus from parallel text corpus.
 
 This script processes a parallel text corpus (JSONL format) and generates
-machine translations using HuggingFace models. For each input record, it
+machine translations using supported translation models. For each input record, it
 produces two translations: original→English and English→original.
 
 Each output record represents a single translation with fields: tr_id, pair_id,
@@ -35,7 +35,7 @@ def validate_model(model: str) -> None:
     Validate that the specified model is supported.
 
     Args:
-        model: HuggingFace model identifier
+        model: Model identifier
 
     Raises:
         ValueError: If model is not supported
@@ -61,7 +61,7 @@ def generate_translation_record(
 
     Args:
         pair_id: ID of the source parallel text pair
-        model: HuggingFace model identifier
+        model: Model identifier
         src_lang: Source language ISO 639-1 code
         tgt_lang: Target language ISO 639-1 code
         src_text: Source text to translate
@@ -111,7 +111,7 @@ def generate_translations(
 
     Args:
         input_path: Path to input parallel corpus JSONL file
-        model: HuggingFace model identifier
+        model: Model identifier
         verbose: If True, print timing and token information during translation
 
     Yields:
@@ -186,7 +186,7 @@ def save_translated_corpus(
     Args:
         input_path: Path to input parallel corpus JSONL file
         output_path: Path to output machine translation corpus JSONL file
-        model: HuggingFace model identifier
+        model: Model identifier
         verbose: If True, print timing and token information during translation
     """
     # Count total records for progress bar
@@ -224,7 +224,7 @@ def main():
     )
     args.add_argument(
         "model",
-        help="HuggingFace model identifier (e.g., tencent/HY-MT1.5-7B)",
+        help="Model identifier (e.g., tencent/HY-MT1.5-7B, google/translation-llm)",
     )
     args.add_argument(
         "input", type=pathlib.Path, help="Input parallel corpus JSONL file"