Skip to content

Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples#3

Open
Maxusmusti wants to merge 3 commits into
red-hat-data-services:mainfrom
Maxusmusti:main
Open

Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples#3
Maxusmusti wants to merge 3 commits into
red-hat-data-services:mainfrom
Maxusmusti:main

Conversation

@Maxusmusti
Copy link
Copy Markdown

Adding the existing set of values for running the instructlab multi-phase training pipeline on various GPU hardware configurations. Adds a notebook referencing our multiphase cookbook, including both the configurations themselves and how to use them in the multiphase cookbooks.

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
@@ -0,0 +1,16 @@
# InstructLab Multi-Phase Training Configurations
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest we start grouping the examples under examples/training_hub for example to better organize the examples from different part of the product.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think that makes sense. One question is whether we will have other components of the instructlab pipeline / notebooks in this repo. @aditisaluja5 was a decision made as to whether the full end-to-end pipeline example will live here? And if so, will that be as a series of notebooks (including the training notebook), or a single all-encompassing notebook?

If it is just the training configs living here, then I think @astefanutti's suggestions are best, where we can have a training_hub sub-directory and include the training notebook itself in there as well. If we are including the full pipeline in here, however, then we should probably create an instructlab sub-directory, add all of the step notebooks in there (data prep, sdg, training, eval, etc.) and have this config notebook live alongside them. LMK what the plan is and I can proceed accordingly

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the late response. The end to end examples will also live in this repo.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?

{
"cell_type": "markdown",
"metadata": {},
"source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have lab_multiphase_training_tutorial.ipynb added here so the example is standalone. WDYT?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to have the end to end instruct lab pipeline example here instead of just training example.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also related to above comment: #3 (comment)


See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including:

- **H200**: 1x, 2x, 4x, 8x GPU configurations
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always single node or multi-node is supported? That would be useful to mention it explicitly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a disclaimer to the README

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>
@@ -0,0 +1,16 @@
# InstructLab Multi-Phase Training Configurations
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?


Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.

Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "cooresponding" -> "corresponding"

tarun-etikala referenced this pull request in tarun-e/red-hat-ai-examples Nov 10, 2025
@jiridanek
Copy link
Copy Markdown
Member

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

📝 Walkthrough

Walkthrough

Adds documentation and a Jupyter notebook for InstructLab Multi-Phase Training Configurations, providing hardware-specific GPU setup guidance including configurations for various GPUs (H200, H100, A100, L40S, L4) with memory-optimized settings and FSDP parameters.

Changes

Cohort / File(s) Summary
Documentation & Tutorial Files
examples/instructlab-multiphase-configs/README.md, examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb
Adds README describing InstructLab Multi-Phase Training configurations with Quick Start guidance and hardware-specific setups. Includes Jupyter notebook with FSDP CPU offload configuration reference, integration examples with training_hub.sft and FSDPOptions, and comprehensive configuration blocks for multiple GPU architectures (H200, H100, A100, L40S) with tuning parameters (max_tokens_per_gpu, max_seq_len, nproc_per_node, cpu_offload_params).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding InstructLab multi-phase training hardware configurations to the examples directory, which directly corresponds to the added README and Jupyter notebook files.
Description check ✅ Passed The description is directly related to the changeset, explaining that existing hardware configuration values are being added for various GPU setups along with a notebook demonstrating their usage in the multiphase cookbooks.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.5)
examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb

Unexpected end of JSON input


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jiridanek
Copy link
Copy Markdown
Member

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 12, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb`:
- Around line 147-153: The H100 1x GPU Configuration block is incomplete and not
runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0),
define max_seq_len, and set cpu_offload_params to the validated boolean (or
explicit tuning value) so the snippet can be copy/pasted (e.g., set
max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended
sequence length, leave nproc_per_node as 1 if correct); alternatively replace
the entire H100 1x GPU Configuration section with a clear non-runnable note
stating "not yet validated" so users don't try to execute it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ecc3c6b4-bed9-4a5f-a7d3-4f1c88262851

📥 Commits

Reviewing files that changed from the base of the PR and between 37af793 and 52debb4.

📒 Files selected for processing (2)
  • examples/instructlab-multiphase-configs/README.md
  • examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb

Comment on lines +147 to +153
"source": [
"# H100 1x GPU Configuration\n",
"max_tokens_per_gpu = 0 # TO BE FILLED\n",
"nproc_per_node = 1\n",
"\n",
"# FSDP CPU offloading configuration\n",
"cpu_offload_params = False # TO BE FILLED - True to enable CPU offloading, False to disable"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove or complete the unfinished H100 1x config before merging.

Line 149 publishes max_tokens_per_gpu = 0, and Lines 147-153 never define max_seq_len. That makes this block unusable as a copy/paste example and can break outside a stateful notebook session. Either add tested values for all parameters or replace this with a non-runnable “not yet validated” note.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb` around
lines 147 - 153, The H100 1x GPU Configuration block is incomplete and not
runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0),
define max_seq_len, and set cpu_offload_params to the validated boolean (or
explicit tuning value) so the snippet can be copy/pasted (e.g., set
max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended
sequence length, leave nproc_per_node as 1 if correct); alternatively replace
the entire H100 1x GPU Configuration section with a clear non-runnable note
stating "not yet validated" so users don't try to execute it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants