Conversation
Implements translation method using Google Cloud's Translation LLM (TLLM) model, providing state-of-the-art translation quality for Chinese, Japanese, Spanish, and English. The implementation is language-agnostic and uses ISO 639-1 codes. Changes: - Add google_cloud_translate() function in src/muse/translation/translate.py - Register google/translation-llm in unified translate() interface - Add google-cloud-translate>=3.15.0 dependency to pyproject.toml - Create test_google_cloud_translate.py for API infrastructure testing - Add Google Cloud setup documentation to README.md - Generate google_translate_results.jsonl with test translations Authentication uses service account credentials via GOOGLE_APPLICATION_CREDENTIALS environment variable. All tests passing with response times of 0.3-0.6 seconds.
Implements translation method using Google Cloud's Translation LLM (TLLM) model, providing state-of-the-art translation quality for Chinese, Japanese, Spanish, and English. The implementation is language-agnostic and uses ISO 639-1 codes. Changes: - Add google_cloud_translate() function in src/muse/translation/translate.py - Register google/translation-llm in unified translate() interface - Add google-cloud-translate>=3.15.0 dependency to pyproject.toml - Create test_google_cloud_translate.py for API infrastructure testing - Add Google Cloud setup documentation to README.md - Generate google_translate_results.jsonl with test translations Authentication uses service account credentials via GOOGLE_APPLICATION_CREDENTIALS environment variable. All tests passing with response times of 0.3-0.6 seconds.
Update script name and documentation to better reflect its purpose as an API connectivity test rather than a general Google Cloud Translate test.
Consolidate all test sections into a single streamlined output showing: - Credentials validation - Client initialization - API call success with timing - Model information - Source text and translation result
a306709 to
3738da8
Compare
laurejt
left a comment
There was a problem hiding this comment.
Thanks for looking into the robustness of the google cloud translation API, but be sure to document this in the issue. Remember also that this issue should include a cost estimate for the Notion parallel sentence corpus.
Note: This PR will not capture #4. Additional updates will be needed to the translate_corpus.py script to fail early if errors are encountered with Google Translate. Do not begin working on #4 until this PR and #24 are merged/closed.
Requested Changes:
-
Move the google cloud translation code to a separate file and do not integrate it into the HuggingFace translation script.
-
Update the code so that credentials are assumed. There should be no information about the google cloud project hard-coded in the file. Instead use something like this line from the google cloud translate documentation:
PROJECT_ID = os.environ.get("GOOGLE_CLOUD_PROJECT")
- Update the documentation to describe the expected credentials set up / assumptions.
| # Supported models for the unified translate() function | ||
| SUPPORTED_MODELS = { | ||
| "tencent/HY-MT1.5-7B": "hymt", | ||
| "facebook/nllb-200-3.3B": "nllb", | ||
| "google/madlad400-7b-mt": "madlad", | ||
| "google/translation-llm": "google_cloud", | ||
| } |
There was a problem hiding this comment.
Get rid of this global index. The underlying models default to the desired version. Instead the short names (i.e., the values of SUPPORTED_MODELS) should be the expected forms for specifying the model in the unified translate method.
However, this change should occur in a different PR (#24)
src/muse/translation/translate.py
Outdated
| project_id: str = "cdh-muse", | ||
| region: str = "us-central1", | ||
| credentials_path: str | None = None, |
There was a problem hiding this comment.
Remove these inputs, these should be inferred from the active gcloud config
| 1. Service account JSON file (via credentials_path parameter) | ||
| 2. GOOGLE_APPLICATION_CREDENTIALS environment variable | ||
| 3. Application Default Credentials (gcloud auth application-default login) | ||
| """ |
There was a problem hiding this comment.
Rewrite to describe assumed gcloud usage
There was a problem hiding this comment.
Move the setup notes to DEVELOPERNOTES.md
Update instructions to use gcloud: https://docs.cloud.google.com/translate/docs/authentication#client-libs
pyproject.toml
Outdated
| "numpy", | ||
| "ipython", # Required by transformers for trainer functionality | ||
| "transformers[torch, sentencepiece, tiktoken]", | ||
| "google-cloud-translate>=3.15.0", |
There was a problem hiding this comment.
Why this specific version? Only specify specific versions when it is absolutely necessary.
google_translate_results.jsonl
Outdated
There was a problem hiding this comment.
Delete this. As we've discussed previously, no data files should be checked into the repo.
There was a problem hiding this comment.
Remove this test script since it is not testing any code from the muse package.
src/muse/translation/translate.py
Outdated
| try: | ||
| response = client.translate_text( | ||
| contents=[text], | ||
| target_language_code=tgt_lang, | ||
| source_language_code=src_lang, | ||
| parent=parent, | ||
| model=model_path, | ||
| mime_type="text/plain", | ||
| ) | ||
| except Exception as e: | ||
| raise Exception(f"Google Cloud Translate API call failed: {e}") from e |
There was a problem hiding this comment.
This looks like an unnecessary try/except, remove it.
|
@tanhaow As we discussed in the white boarding session, the google cloud translate code can also be folded into the general translate function. This commend supersedes my earlier comments. |
|
@laurejt Thanks for your review! It's ready for your second review. Changes:
|
Associated Issue(s): resolves #3 and #4 (The GCP API's performance is stable and doesn't have many errors. I think it's better to integrate GCP translation into the unified translate script that we already have.)
Changes in this PR
google_cloud_translate()function and registeredgoogle/translation-llmin the unifiedtranslate()interface to support translate text using Google Cloud's Translation LLM (TLLM) modelgoogle-cloud-translatedependency topyproject.tomltest_scripts/test_google_cloud_translate.pyfor API infrastructure testingNotes
GOOGLE_APPLICATION_CREDENTIALSenvironment variable. Follow the instructions in README and configure it before running.Reviewer Checklist
export GOOGLE_APPLICATION_CREDENTIALS="cdh-muse-6950d66acf83.json"python test_scripts/test_google_cloud_translate.pytest_scripts/test_translate.py) and the test dataset (test_cases.jsonl; dowloaded from Google Drive):python test_scripts/test_translate.py google/translation-llm test_cases.jsonl