-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/add google cloud translate #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
9a9577e
Add Google Cloud Translate API integration with Translation LLM
tanhaow 4f3f188
Delete google_translate_results.jsonl
tanhaow 43f57d0
Add Google Cloud Translate API integration with Translation LLM
tanhaow 10cf85b
Rename test script to test_google_api_connectivity.py
tanhaow 67fe2b2
Simplify connectivity test to single unified output
tanhaow ece44b2
Update test_google_api_connectivity.py
tanhaow 25c47f6
Update comments
tanhaow 3738da8
Update test_google_api_connectivity.py
tanhaow 5adcb95
Revise per @laurejt 's review
tanhaow d4d5fcc
Merge remote-tracking branch 'origin/develop' into feature/add-google…
tanhaow File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| # Developer Notes | ||
|
|
||
| ## Google Cloud Translation Setup | ||
|
|
||
| The MUSE project supports Google Cloud's Translation LLM (TLLM) model for machine translation. This requires Google Cloud CLI (gcloud) setup and authentication. | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| 1. **Install Google Cloud CLI** | ||
|
|
||
| - Follow instructions at: https://cloud.google.com/sdk/docs/install | ||
| - Verify installation: `gcloud --version` | ||
|
|
||
| 2. **Authenticate with Application Default Credentials** | ||
|
|
||
| ```bash | ||
| gcloud auth application-default login | ||
| ``` | ||
|
|
||
| 3. **Set required environment variables** | ||
|
|
||
| ```bash | ||
| export GOOGLE_CLOUD_PROJECT="cdh-muse" | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,11 +5,13 @@ | |
|
|
||
| The translate() function provides a unified interface for translating text across | ||
| multiple models. Model-specific functions (hymt_translate, nllb_translate, | ||
| madlad_translate) are also available for direct use. | ||
| madlad_translate, google_cloud_translate) are also available for direct use. | ||
| """ | ||
|
|
||
| import os | ||
| from timeit import default_timer as timer | ||
|
|
||
| from google.cloud import translate_v3 | ||
| from transformers import ( | ||
| AutoModelForCausalLM, | ||
| AutoModelForSeq2SeqLM, | ||
|
|
@@ -28,6 +30,7 @@ | |
| "tencent/HY-MT1.5-7B": "hymt", | ||
| "facebook/nllb-200-3.3B": "nllb", | ||
| "google/madlad400-7b-mt": "madlad", | ||
| "google/translation-llm": "google_cloud", | ||
| } | ||
|
|
||
|
|
||
|
|
@@ -217,6 +220,72 @@ def madlad_translate( | |
| return tr_text | ||
|
|
||
|
|
||
| def google_cloud_translate( | ||
| src_lang: str, | ||
| tgt_lang: str, | ||
| text: str, | ||
| verbose: bool = False, | ||
| ) -> str: | ||
| """ | ||
| Translate text using Google Cloud Translate API with Translation LLM (TLLM) model. | ||
| Languages are specified with their ISO 639-1 codes (e.g., "zh", "ja", "es", "en"). | ||
|
|
||
| Requires gcloud CLI authentication. See docs/DEVELOPERNOTES.md for setup. | ||
|
|
||
| Args: | ||
| src_lang: Source language ISO 639-1 code | ||
| tgt_lang: Target language ISO 639-1 code | ||
| text: Text to translate from source to target language | ||
| verbose: If True, print timing information | ||
|
|
||
| Returns: | ||
| Translated text as a string | ||
|
|
||
| Raises: | ||
| ValueError: If GOOGLE_CLOUD_PROJECT environment variable is not set | ||
| """ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rewrite to describe assumed gcloud usage |
||
| project_id = os.environ.get("GOOGLE_CLOUD_PROJECT") | ||
| if not project_id: | ||
| raise ValueError( | ||
| "GOOGLE_CLOUD_PROJECT environment variable is not set. " | ||
| "Set it with: export GOOGLE_CLOUD_PROJECT='cdh-muse'" | ||
| ) | ||
|
|
||
| region = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1") | ||
|
|
||
| if verbose: | ||
| start = timer() | ||
|
|
||
| client = translate_v3.TranslationServiceClient() | ||
|
|
||
| if verbose: | ||
| print( | ||
| f"Initialized Google Cloud Translate client in {timer() - start:.2f} seconds" | ||
| ) | ||
|
|
||
| parent = f"projects/{project_id}/locations/{region}" | ||
| model_path = f"{parent}/models/general/translation-llm" | ||
|
|
||
| if verbose: | ||
| start = timer() | ||
|
|
||
| response = client.translate_text( | ||
| contents=[text], | ||
| target_language_code=tgt_lang, | ||
| source_language_code=src_lang, | ||
| parent=parent, | ||
| model=model_path, | ||
| mime_type="text/plain", | ||
| ) | ||
|
|
||
| if verbose: | ||
| print(f"Received translation response in {timer() - start:.2f} seconds") | ||
|
|
||
| translated_text = response.translations[0].translated_text | ||
|
|
||
| return translated_text | ||
|
|
||
|
|
||
| def translate( | ||
| model: str, | ||
| src_lang: str, | ||
|
|
@@ -225,23 +294,24 @@ def translate( | |
| verbose: bool = False, | ||
| ) -> str: | ||
| """ | ||
| Translate text using a specified HuggingFace translation model. This function | ||
| provides a unified interface for translating text across multiple translation | ||
| models by routing to the appropriate model-specific implementation based on the | ||
| model parameter. | ||
| Translate text using a specified translation model. This function provides a | ||
| unified interface for translating text across multiple translation models by | ||
| routing to the appropriate model-specific implementation based on the model | ||
| parameter. | ||
|
|
||
| Supported models: | ||
| - tencent/HY-MT1.5-7B: Tencent's Hunyuan Translation Model v1.5 (7B) | ||
| - facebook/nllb-200-3.3B: Meta's No Language Left Behind (3.3B) | ||
| - google/madlad400-7b-mt: Google's MADLAD-400 (7B) | ||
| - google/translation-llm: Google Cloud Translation LLM (TLLM) | ||
|
|
||
| Languages are specified using ISO 639-1 codes (e.g., "zh", "ja", "es", "en"). | ||
| Language validation is delegated to the model-specific functions, so supported | ||
| languages vary by model. The MADLAD model does not use the source language | ||
| parameter internally, but it is accepted for API consistency. | ||
|
|
||
| Args: | ||
| model: HuggingFace model identifier (must be one of the supported models) | ||
| model: Model identifier (must be one of the supported models) | ||
| src_lang: Source language ISO 639-1 code | ||
| tgt_lang: Target language ISO 639-1 code | ||
| text: Text to translate from source to target language | ||
|
|
@@ -269,6 +339,8 @@ def translate( | |
| elif model_type == "madlad": | ||
| # MADLAD does not use src_lang parameter | ||
| return madlad_translate(tgt_lang, text, model, verbose) | ||
| elif model_type == "google_cloud": | ||
| return google_cloud_translate(src_lang, tgt_lang, text, verbose=verbose) | ||
| else: | ||
| # This should never happen if SUPPORTED_MODELS is correctly maintained | ||
| raise ValueError(f"Unknown model type: {model_type}") | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the setup notes to DEVELOPERNOTES.md
Update instructions to use
gcloud: https://docs.cloud.google.com/translate/docs/authentication#client-libs