Run Whisper training with Google Cloud buckets by huwenjie333 · Pull Request #70 · SunbirdAI/salt

huwenjie333 · 2026-03-06T08:12:39Z

This PR adds the support of google cloud buckets for the whisper training pipeline, and made several other changes:

load the parquets datasets from gcs:// path with datasets.load_dataset and cast the audio column to datasets.Audio format.
create a setup shell script that intalls the dependencies and configure the google cloud credentials.
load modules such as salt.datasets from the current repo instead of https://github.com/jqug/salt.git
move the yaml config from training notebook to a separate file
fix several training errors with following changes:
- Use BF16 instead of FP16 for training
- set gradient_checkpointing=False
- add torch_dtype=torch.float32 when loading the model weights
- updates model.generation_config based on requirements from the new version.

Overfit experiment

An overfit experiment with just 100 examples was done to verify the changes:
MLflow run1 with evaluation metrics: https://mlflow-sunbird-ce0ecfc14244.herokuapp.com/#/experiments/0/runs/2d488acdc39146e9af9da07c00128d49/model-metrics
MLfLow run2 with GPU utilization: https://mlflow.sunbird.ai/#/experiments/0/runs/811bbdf051f44597bd90c3376cfc9309/system-metrics

TODO

we need to update the salt.constants.SALT_LANGUAGE_TOKENS_WHISPER to support new languages. Currently we only have the following:

SALT_LANGUAGE_TOKENS_WHISPER = {
    # Exact/close mapping
    'eng': 50259,
    'swa': 50318,
    # Overwrite unused language tokens
    'ach': 50357,
    'lgg': 50356,
    'lug': 50355,
    'nyn': 50354,
    'teo': 50353,
    'xog': 50352,
    'ttj': 50351,
    'kin': 50350,
    'myx': 50349,
    'kik': 50348,
}

currently each evaluation step takes 3-4 mins. I'm not sure whether it is expected

review-notebook-app · 2026-03-06T08:12:45Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

jqug

Thanks for this, looks good.
Just one thing, let's take out the gcloud auth for now and maybe mention in a comment in the file that this may be necessary.

jqug · 2026-03-09T11:20:09Z

whisper_training_setup.sh

+sudo apt-get install -y google-cloud-cli
+
+gcloud init
+gcloud auth application-default login


If we load from a public GCS bucket, is this step necessary?
We could include in the comments for the file that these gcloud lines should be run in order to access nonpublic buckets (but should only be done on trusted machines).

ak3ra · 2026-03-09T12:30:53Z

We should consider merging this notebook into the dedicated sunbird-speech repo:
I have been working on refactoring it here https://github.com/SunbirdAI/sunbird-speech

…ngual_eval_fn processing; fix label 448 limit; launch full training

jqug · 2026-03-17T00:13:36Z

notebooks/training/configs/whisper_finetuning_gcs.yaml

+  gradient_accumulation_steps: 4
+  learning_rate: 1.0e-05
+  warmup_steps: 500
+  max_steps: 7500


Maybe you already caught this, but I think this argument is why the training time looked like it was only going to be ~13 hours, because it would stop prematurely after this number of steps. (With our previous dataset this was about 5 epochs after which the model converged pretty well).

… fix preprocess error

huwenjie333 added 3 commits March 4, 2026 17:19

init

38f72f0

fixes to start training

9532576

training updates

e4e269c

huwenjie333 added 2 commits March 6, 2026 17:26

clean up dataset load and notebook; add gpu metrics

5dad82f

support current huggingface_load

c02b14d

huwenjie333 changed the title ~~[WIP] Run Whisper training with Google Cloud buckets~~ Run Whisper training with Google Cloud buckets Mar 6, 2026

huwenjie333 requested review from ak3ra, evie-8 and jqug March 6, 2026 12:02

jqug reviewed Mar 9, 2026

View reviewed changes

huwenjie333 added 5 commits March 11, 2026 16:19

add gcs_key_path

4597e15

update SALT_LANGUAGE_TOKENS_WHISPER

79d7fa5

add all gcs datasets

ba89074

update datasets in yaml by script

043ee6f

fix empty folder; fix download_datasets_in_parallel; fix long multili…

8899665

…ngual_eval_fn processing; fix label 448 limit; launch full training

jqug reviewed Mar 17, 2026

View reviewed changes

huwenjie333 added 6 commits March 18, 2026 20:36

multi-thread for valid dataset; skip matching

91fea72

valid max_examples_per_dataset 50

78cea35

train with script; epoch=2; fix dataset max 50;

78b95bc

add back data aug; calculate train steps by epoch; eval show progress

398087f

reduce eval max exmaple to 20; max_steps: 8000; add eval predict log;…

8947be5

… fix preprocess error

disable augment_audio_noise

c1dac1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Whisper training with Google Cloud buckets#70

Run Whisper training with Google Cloud buckets#70
huwenjie333 wants to merge 16 commits intomainfrom
whisper_gcp

huwenjie333 commented Mar 6, 2026 •

edited

Loading

Uh oh!

review-notebook-app bot commented Mar 6, 2026

Uh oh!

jqug left a comment

Uh oh!

jqug Mar 9, 2026

Uh oh!

ak3ra commented Mar 9, 2026

Uh oh!

jqug Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huwenjie333 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overfit experiment

TODO

Uh oh!

review-notebook-app bot commented Mar 6, 2026

Uh oh!

jqug left a comment

Choose a reason for hiding this comment

Uh oh!

jqug Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

ak3ra commented Mar 9, 2026

Uh oh!

jqug Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huwenjie333 commented Mar 6, 2026 •

edited

Loading