Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples#3
Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples#3Maxusmusti wants to merge 3 commits into
Conversation
Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
| @@ -0,0 +1,16 @@ | |||
| # InstructLab Multi-Phase Training Configurations | |||
There was a problem hiding this comment.
I'd suggest we start grouping the examples under examples/training_hub for example to better organize the examples from different part of the product.
There was a problem hiding this comment.
yeah, I think that makes sense. One question is whether we will have other components of the instructlab pipeline / notebooks in this repo. @aditisaluja5 was a decision made as to whether the full end-to-end pipeline example will live here? And if so, will that be as a series of notebooks (including the training notebook), or a single all-encompassing notebook?
If it is just the training configs living here, then I think @astefanutti's suggestions are best, where we can have a training_hub sub-directory and include the training notebook itself in there as well. If we are including the full pipeline in here, however, then we should probably create an instructlab sub-directory, add all of the step notebooks in there (data prep, sdg, training, eval, etc.) and have this config notebook live alongside them. LMK what the plan is and I can proceed accordingly
There was a problem hiding this comment.
Sorry, for the late response. The end to end examples will also live in this repo.
There was a problem hiding this comment.
It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware." |
There was a problem hiding this comment.
It would be better to have lab_multiphase_training_tutorial.ipynb added here so the example is standalone. WDYT?
There was a problem hiding this comment.
It might make sense to have the end to end instruct lab pipeline example here instead of just training example.
|
|
||
| See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including: | ||
|
|
||
| - **H200**: 1x, 2x, 4x, 8x GPU configurations |
There was a problem hiding this comment.
Is this always single node or multi-node is supported? That would be useful to mention it explicitly.
There was a problem hiding this comment.
Good point, added a disclaimer to the README
Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
| @@ -0,0 +1,16 @@ | |||
| # InstructLab Multi-Phase Training Configurations | |||
There was a problem hiding this comment.
It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?
|
|
||
| Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters. | ||
|
|
||
| Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints. |
There was a problem hiding this comment.
nit: "cooresponding" -> "corresponding"
Feature/refactor
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughAdds documentation and a Jupyter notebook for InstructLab Multi-Phase Training Configurations, providing hardware-specific GPU setup guidance including configurations for various GPUs (H200, H100, A100, L40S, L4) with memory-optimized settings and FSDP parameters. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Ruff (0.15.5)examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynbUnexpected end of JSON input Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb`:
- Around line 147-153: The H100 1x GPU Configuration block is incomplete and not
runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0),
define max_seq_len, and set cpu_offload_params to the validated boolean (or
explicit tuning value) so the snippet can be copy/pasted (e.g., set
max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended
sequence length, leave nproc_per_node as 1 if correct); alternatively replace
the entire H100 1x GPU Configuration section with a clear non-runnable note
stating "not yet validated" so users don't try to execute it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ecc3c6b4-bed9-4a5f-a7d3-4f1c88262851
📒 Files selected for processing (2)
examples/instructlab-multiphase-configs/README.mdexamples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb
| "source": [ | ||
| "# H100 1x GPU Configuration\n", | ||
| "max_tokens_per_gpu = 0 # TO BE FILLED\n", | ||
| "nproc_per_node = 1\n", | ||
| "\n", | ||
| "# FSDP CPU offloading configuration\n", | ||
| "cpu_offload_params = False # TO BE FILLED - True to enable CPU offloading, False to disable" |
There was a problem hiding this comment.
Remove or complete the unfinished H100 1x config before merging.
Line 149 publishes max_tokens_per_gpu = 0, and Lines 147-153 never define max_seq_len. That makes this block unusable as a copy/paste example and can break outside a stateful notebook session. Either add tested values for all parameters or replace this with a non-runnable “not yet validated” note.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb` around
lines 147 - 153, The H100 1x GPU Configuration block is incomplete and not
runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0),
define max_seq_len, and set cpu_offload_params to the validated boolean (or
explicit tuning value) so the snippet can be copy/pasted (e.g., set
max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended
sequence length, leave nproc_per_node as 1 if correct); alternatively replace
the entire H100 1x GPU Configuration section with a clear non-runnable note
stating "not yet validated" so users don't try to execute it.
Adding the existing set of values for running the instructlab multi-phase training pipeline on various GPU hardware configurations. Adds a notebook referencing our multiphase cookbook, including both the configurations themselves and how to use them in the multiphase cookbooks.