Skip to content

Feature/add google cloud translate#31

Merged
laurejt merged 10 commits intodevelopfrom
feature/add-google-cloud-translate
Feb 19, 2026
Merged

Feature/add google cloud translate#31
laurejt merged 10 commits intodevelopfrom
feature/add-google-cloud-translate

Conversation

@tanhaow
Copy link

@tanhaow tanhaow commented Feb 17, 2026

Associated Issue(s): resolves #3 and #4 (The GCP API's performance is stable and doesn't have many errors. I think it's better to integrate GCP translation into the unified translate script that we already have.)

Changes in this PR

  • Added google_cloud_translate() function and registered google/translation-llm in the unified translate() interface to support translate text using Google Cloud's Translation LLM (TLLM) model
  • Added google-cloud-translate dependency to pyproject.toml
  • Added test script test_scripts/test_google_cloud_translate.py for API infrastructure testing
  • Added Google Cloud setup documentation to README.md

Notes

  • Authentication requires service account credentials via GOOGLE_APPLICATION_CREDENTIALS environment variable. Follow the instructions in README and configure it before running.

Reviewer Checklist

  • Follow the instructions in README to download Google Cloud credentials
  • Set up Google Cloud credentials: export GOOGLE_APPLICATION_CREDENTIALS="cdh-muse-6950d66acf83.json"
  • Run API infrastructure tests: python test_scripts/test_google_cloud_translate.py
  • Generate translation results with TLLM using the test script (test_scripts/test_translate.py) and the test dataset (test_cases.jsonl; dowloaded from Google Drive): python test_scripts/test_translate.py google/translation-llm test_cases.jsonl

Implements translation method using Google Cloud's Translation LLM (TLLM) model,
providing state-of-the-art translation quality for Chinese, Japanese, Spanish,
and English. The implementation is language-agnostic and uses ISO 639-1 codes.

Changes:
- Add google_cloud_translate() function in src/muse/translation/translate.py
- Register google/translation-llm in unified translate() interface
- Add google-cloud-translate>=3.15.0 dependency to pyproject.toml
- Create test_google_cloud_translate.py for API infrastructure testing
- Add Google Cloud setup documentation to README.md
- Generate google_translate_results.jsonl with test translations

Authentication uses service account credentials via GOOGLE_APPLICATION_CREDENTIALS
environment variable. All tests passing with response times of 0.3-0.6 seconds.
Implements translation method using Google Cloud's Translation LLM (TLLM) model,
providing state-of-the-art translation quality for Chinese, Japanese, Spanish,
and English. The implementation is language-agnostic and uses ISO 639-1 codes.

Changes:
- Add google_cloud_translate() function in src/muse/translation/translate.py
- Register google/translation-llm in unified translate() interface
- Add google-cloud-translate>=3.15.0 dependency to pyproject.toml
- Create test_google_cloud_translate.py for API infrastructure testing
- Add Google Cloud setup documentation to README.md
- Generate google_translate_results.jsonl with test translations

Authentication uses service account credentials via GOOGLE_APPLICATION_CREDENTIALS
environment variable. All tests passing with response times of 0.3-0.6 seconds.
Update script name and documentation to better reflect its purpose as an API
connectivity test rather than a general Google Cloud Translate test.
Consolidate all test sections into a single streamlined output showing:
- Credentials validation
- Client initialization
- API call success with timing
- Model information
- Source text and translation result
@tanhaow tanhaow self-assigned this Feb 17, 2026
@tanhaow tanhaow requested a review from laurejt February 17, 2026 19:12
@tanhaow tanhaow force-pushed the feature/add-google-cloud-translate branch from a306709 to 3738da8 Compare February 17, 2026 20:07
Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into the robustness of the google cloud translation API, but be sure to document this in the issue. Remember also that this issue should include a cost estimate for the Notion parallel sentence corpus.

Note: This PR will not capture #4. Additional updates will be needed to the translate_corpus.py script to fail early if errors are encountered with Google Translate. Do not begin working on #4 until this PR and #24 are merged/closed.

Requested Changes:

  1. Move the google cloud translation code to a separate file and do not integrate it into the HuggingFace translation script.

  2. Update the code so that credentials are assumed. There should be no information about the google cloud project hard-coded in the file. Instead use something like this line from the google cloud translate documentation:

PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")

  1. Update the documentation to describe the expected credentials set up / assumptions.

Comment on lines 27 to 33
# Supported models for the unified translate() function
SUPPORTED_MODELS = {
"tencent/HY-MT1.5-7B": "hymt",
"facebook/nllb-200-3.3B": "nllb",
"google/madlad400-7b-mt": "madlad",
"google/translation-llm": "google_cloud",
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of this global index. The underlying models default to the desired version. Instead the short names (i.e., the values of SUPPORTED_MODELS) should be the expected forms for specifying the model in the unified translate method.

However, this change should occur in a different PR (#24)

Comment on lines +226 to +228
project_id: str = "cdh-muse",
region: str = "us-central1",
credentials_path: str | None = None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these inputs, these should be inferred from the active gcloud config

1. Service account JSON file (via credentials_path parameter)
2. GOOGLE_APPLICATION_CREDENTIALS environment variable
3. Application Default Credentials (gcloud auth application-default login)
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite to describe assumed gcloud usage

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the setup notes to DEVELOPERNOTES.md

Update instructions to use gcloud: https://docs.cloud.google.com/translate/docs/authentication#client-libs

pyproject.toml Outdated
"numpy",
"ipython", # Required by transformers for trainer functionality
"transformers[torch, sentencepiece, tiktoken]",
"google-cloud-translate>=3.15.0",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this specific version? Only specify specific versions when it is absolutely necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this. As we've discussed previously, no data files should be checked into the repo.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this test script since it is not testing any code from the muse package.

Comment on lines +273 to +283
try:
response = client.translate_text(
contents=[text],
target_language_code=tgt_lang,
source_language_code=src_lang,
parent=parent,
model=model_path,
mime_type="text/plain",
)
except Exception as e:
raise Exception(f"Google Cloud Translate API call failed: {e}") from e
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unnecessary try/except, remove it.

@laurejt
Copy link

laurejt commented Feb 18, 2026

@tanhaow As we discussed in the white boarding session, the google cloud translate code can also be folded into the general translate function. This commend supersedes my earlier comments.

@tanhaow tanhaow requested a review from laurejt February 19, 2026 17:18
@tanhaow
Copy link
Author

tanhaow commented Feb 19, 2026

@laurejt Thanks for your review! It's ready for your second review. Changes:

  • Uses environment variables to config GOOGLE_CLOUD_PROJECT (cdh-muse) and GOOGLE_CLOUD_REGION (us-1)
  • Removed unnecessary try/except wrapper around API call
  • Removed version constraint from google-cloud-translate dependency
  • Deleted API connectivity test script
  • Moved setup documentation to docs/DEVELOPERNOTES.md with gcloud CLI authentication instructions
  • Updated all "HuggingFace model identifier" references in translate_corpus.py to "Model identifier (HuggingFace or Google Cloud model)"

Copy link

@laurejt laurejt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@laurejt laurejt merged commit 4013379 into develop Feb 19, 2026
1 check passed
@laurejt laurejt deleted the feature/add-google-cloud-translate branch February 19, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants