GitHub - iriis-research/translation-pipeline

Environment Setup

Step I:  `python -m venv venv`<br><br>
Step II:  `pip install -r requirements.txt`

.env file

Get the GEMINI_API_KEY from Google AI studio.
Create the .env file and add it as GEMINI_API_KEY.

Translation Dataset

Keep the data you want to translate in data directory.

And the data should be in json format.

Example data:

[
    {
        "query": "cruise portland maine",
        "ad_title": "New England Cruises",
        "ad_description": "Your New England Cruise Awaits! Holland America Line Official Site.",
        "relevance_label": 1
    },
    {
        "query": "transportation to cruise port miami",
        "ad_title": "Holland America Line\u00ae",
        "ad_description": "Explore Your World with Four Extraordinary Offers.",
        "relevance_label": 0
    },
    {
        "query": "transportation to cruise port miami",
        "ad_title": "Holland America Line\u00ae",
        "ad_description": "Cruise to Your Own Private Island In the Caribbean. Learn More Now.",
        "relevance_label": 1
    }
]

***Notebook for qadsm data is available in `data/QADSM.ipynb`***

You can load in this format from huggingface.

Fields might differ depending upon the dataset.

If your data points have different fields, then make a new branch before working on it.

The size of data in the json fileshould not be more than 50*250 = 12500 elements in the list, because the rate limit for gemini flash is 250 per day currently.

The main branch is the default branch and it is for QADSM dataset.

For different dataset

First create a new branch
- git checkout -b <dataset_name>
Update and add the new prompt in system_instruction.yaml
- Name it : gemini_translation_system_instruction_<dataset_name>
- We do not need to update much, maybe just the field names and format.
Then change the system instruction we are going to use in create_gemini_prompt function in util/utils.py file, line number 19.

Running Batch inference

Run gemini/gemini_calls.ipynb

Before running, update the INPUT_PATH & OUTPUT_PATH,

INPUT_PATH should be in following format : data/<dataset_name>/<data_file_name>.json
OUTPUT_PATH should be in following format: translated_output/<dataset_name>/<output_file_name>.json

NOTE: DO NOT CHANGE THE BATCH SIZE FROM 500 becaue it cannot handle more than 600-800 and to be safe, we are using 500.

Pushing the code

git push --set-upstream origin <branch_name>

if you are pushing for first time.

git push

if you are pushing after.

You can push to the branch after the translated json is saved in translated_output/<dataset_name>/<output_file_name>.json.

Leave each branch as it it, do not merge!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/qadsm		data/qadsm
gemini		gemini
translated_output/qadsm		translated_output/qadsm
util		util
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
gemini_output.json		gemini_output.json
gemini_output.txt		gemini_output.txt
requirements.txt		requirements.txt
system_instruction.yaml		system_instruction.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Environment Setup

.env file

Translation Dataset

Example data:

For different dataset

Running Batch inference

Pushing the code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Environment Setup

.env file

Translation Dataset

Example data:

For different dataset

Running Batch inference

Pushing the code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages