Skip to content

iriis-research/translation-pipeline

Repository files navigation

Environment Setup

Step I:  `python -m venv venv`<br><br>
Step II:  `pip install -r requirements.txt`

.env file

  • Get the GEMINI_API_KEY from Google AI studio.

  • Create the .env file and add it as GEMINI_API_KEY.

Translation Dataset

Keep the data you want to translate in data directory.

And the data should be in json format.

Example data:

[
    {
        "query": "cruise portland maine",
        "ad_title": "New England Cruises",
        "ad_description": "Your New England Cruise Awaits! Holland America Line Official Site.",
        "relevance_label": 1
    },
    {
        "query": "transportation to cruise port miami",
        "ad_title": "Holland America Line\u00ae",
        "ad_description": "Explore Your World with Four Extraordinary Offers.",
        "relevance_label": 0
    },
    {
        "query": "transportation to cruise port miami",
        "ad_title": "Holland America Line\u00ae",
        "ad_description": "Cruise to Your Own Private Island In the Caribbean. Learn More Now.",
        "relevance_label": 1
    }
]

***Notebook for qadsm data is available in `data/QADSM.ipynb`***

You can load in this format from huggingface.

Fields might differ depending upon the dataset.

If your data points have different fields, then make a new branch before working on it.

The size of data in the json fileshould not be more than 50*250 = 12500 elements in the list, because the rate limit for gemini flash is 250 per day currently.

The main branch is the default branch and it is for QADSM dataset.

For different dataset

  1. First create a new branch

    • git checkout -b <dataset_name>
  2. Update and add the new prompt in system_instruction.yaml

    • Name it : gemini_translation_system_instruction_<dataset_name>
    • We do not need to update much, maybe just the field names and format.
  3. Then change the system instruction we are going to use in create_gemini_prompt function in util/utils.py file, line number 19.


Running Batch inference

Run gemini/gemini_calls.ipynb

Before running, update the INPUT_PATH & OUTPUT_PATH,

INPUT_PATH should be in following format : data/<dataset_name>/<data_file_name>.json
OUTPUT_PATH should be in following format: translated_output/<dataset_name>/<output_file_name>.json

NOTE: DO NOT CHANGE THE BATCH SIZE FROM 500 becaue it cannot handle more than 600-800 and to be safe, we are using 500.

Pushing the code

git push --set-upstream origin <branch_name>

  • if you are pushing for first time.

git push

  • if you are pushing after.

You can push to the branch after the translated json is saved in translated_output/<dataset_name>/<output_file_name>.json.

Leave each branch as it it, do not merge!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors