Skip to content

Run Whisper training with Google Cloud buckets#70

Open
huwenjie333 wants to merge 16 commits intomainfrom
whisper_gcp
Open

Run Whisper training with Google Cloud buckets#70
huwenjie333 wants to merge 16 commits intomainfrom
whisper_gcp

Conversation

@huwenjie333
Copy link

@huwenjie333 huwenjie333 commented Mar 6, 2026

This PR adds the support of google cloud buckets for the whisper training pipeline, and made several other changes:

  • load the parquets datasets from gcs:// path with datasets.load_dataset and cast the audio column to datasets.Audio format.
  • create a setup shell script that intalls the dependencies and configure the google cloud credentials.
  • load modules such as salt.datasets from the current repo instead of https://github.com/jqug/salt.git
  • move the yaml config from training notebook to a separate file
  • fix several training errors with following changes:
    • Use BF16 instead of FP16 for training
    • set gradient_checkpointing=False
    • add torch_dtype=torch.float32 when loading the model weights
    • updates model.generation_config based on requirements from the new version.

Overfit experiment

An overfit experiment with just 100 examples was done to verify the changes:
MLflow run1 with evaluation metrics: https://mlflow-sunbird-ce0ecfc14244.herokuapp.com/#/experiments/0/runs/2d488acdc39146e9af9da07c00128d49/model-metrics
MLfLow run2 with GPU utilization: https://mlflow.sunbird.ai/#/experiments/0/runs/811bbdf051f44597bd90c3376cfc9309/system-metrics

Screenshot 2026-03-06 at 7 26 14 PM Screenshot 2026-03-05 at 5 05 26 PM

TODO

  • we need to update the salt.constants.SALT_LANGUAGE_TOKENS_WHISPER to support new languages. Currently we only have the following:
SALT_LANGUAGE_TOKENS_WHISPER = {
    # Exact/close mapping
    'eng': 50259,
    'swa': 50318,
    # Overwrite unused language tokens
    'ach': 50357,
    'lgg': 50356,
    'lug': 50355,
    'nyn': 50354,
    'teo': 50353,
    'xog': 50352,
    'ttj': 50351,
    'kin': 50350,
    'myx': 50349,
    'kik': 50348,
}
  • currently each evaluation step takes 3-4 mins. I'm not sure whether it is expected
image

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@huwenjie333 huwenjie333 changed the title [WIP] Run Whisper training with Google Cloud buckets Run Whisper training with Google Cloud buckets Mar 6, 2026
@huwenjie333 huwenjie333 requested review from ak3ra, evie-8 and jqug March 6, 2026 12:02
Copy link
Contributor

@jqug jqug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good.
Just one thing, let's take out the gcloud auth for now and maybe mention in a comment in the file that this may be necessary.

sudo apt-get install -y google-cloud-cli

gcloud init
gcloud auth application-default login No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we load from a public GCS bucket, is this step necessary?
We could include in the comments for the file that these gcloud lines should be run in order to access nonpublic buckets (but should only be done on trusted machines).

@ak3ra
Copy link
Contributor

ak3ra commented Mar 9, 2026

We should consider merging this notebook into the dedicated sunbird-speech repo:
I have been working on refactoring it here https://github.com/SunbirdAI/sunbird-speech

gradient_accumulation_steps: 4
learning_rate: 1.0e-05
warmup_steps: 500
max_steps: 7500
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you already caught this, but I think this argument is why the training time looked like it was only going to be ~13 hours, because it would stop prematurely after this number of steps. (With our previous dataset this was about 5 epochs after which the model converged pretty well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants