Adding Classic InstructLab Multi-Phase Training Hardware Configs to Examples by Maxusmusti · Pull Request #3 · red-hat-data-services/red-hat-ai-examples

Maxusmusti · 2025-08-28T18:46:05Z

Adding the existing set of values for running the instructlab multi-phase training pipeline on various GPU hardware configurations. Adds a notebook referencing our multiphase cookbook, including both the configurations themselves and how to use them in the multiphase cookbooks.

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

astefanutti · 2025-08-29T07:58:51Z

@@ -0,0 +1,16 @@
+# InstructLab Multi-Phase Training Configurations


I'd suggest we start grouping the examples under examples/training_hub for example to better organize the examples from different part of the product.

yeah, I think that makes sense. One question is whether we will have other components of the instructlab pipeline / notebooks in this repo. @aditisaluja5 was a decision made as to whether the full end-to-end pipeline example will live here? And if so, will that be as a series of notebooks (including the training notebook), or a single all-encompassing notebook?

If it is just the training configs living here, then I think @astefanutti's suggestions are best, where we can have a training_hub sub-directory and include the training notebook itself in there as well. If we are including the full pipeline in here, however, then we should probably create an instructlab sub-directory, add all of the step notebooks in there (data prep, sdg, training, eval, etc.) and have this config notebook live alongside them. LMK what the plan is and I can proceed accordingly

@aditisaluja5

Sorry, for the late response. The end to end examples will also live in this repo.

It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?

astefanutti · 2025-08-29T08:05:29Z

+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# InstructLab Multi-Phase Training Hardware Configurations\n\nThis notebook contains hardware-specific parameter configurations for use with the [LAB Multi-Phase Training Tutorial](https://github.com/Red-Hat-AI-Innovation-Team/training_hub/blob/main/examples/notebooks/lab_multiphase_training_tutorial.ipynb).\n\n**Model**: These configurations are optimized for `granite-3.1-8b-starter-v2.1`\n\nEach configuration below specifies the optimal parameters for different GPU setups, including:\n- `max_tokens_per_gpu`: Memory limit per GPU to prevent OOM errors\n- `nproc_per_node`: Number of GPUs per node for distributed training\n- `cpu_offload_params`: FSDP CPU offloading configuration for memory optimization\n\n**Usage**: Copy the appropriate configuration parameters from the sections below into your training script based on your available hardware."


It would be better to have lab_multiphase_training_tutorial.ipynb added here so the example is standalone. WDYT?

It might make sense to have the end to end instruct lab pipeline example here instead of just training example.

also related to above comment: #3 (comment)

astefanutti · 2025-08-29T08:09:12Z

+
+See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including:
+
+- **H200**: 1x, 2x, 4x, 8x GPU configurations


Is this always single node or multi-node is supported? That would be useful to mention it explicitly.

Good point, added a disclaimer to the README

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

astefanutti · 2025-10-10T09:17:06Z

@@ -0,0 +1,16 @@
+# InstructLab Multi-Phase Training Configurations


It looks like we are leaning on the latter option. Would examples/ai_hub be more appropriate as sub-directory instead of instructlab?

astefanutti · 2025-10-10T09:17:41Z

+
+Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.
+
+Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints.


nit: "cooresponding" -> "corresponding"

Feature/refactor

jiridanek · 2026-03-12T19:34:40Z

@coderabbitai full review

coderabbitai · 2026-03-12T19:34:54Z

✅ Actions performed

Full review triggered.

coderabbitai · 2026-03-12T19:34:56Z

📝 Walkthrough

Walkthrough

Adds documentation and a Jupyter notebook for InstructLab Multi-Phase Training Configurations, providing hardware-specific GPU setup guidance including configurations for various GPUs (H200, H100, A100, L40S, L4) with memory-optimized settings and FSDP parameters.

Changes

Cohort / File(s)	Summary
Documentation & Tutorial Files `examples/instructlab-multiphase-configs/README.md`, `examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb`	Adds README describing InstructLab Multi-Phase Training configurations with Quick Start guidance and hardware-specific setups. Includes Jupyter notebook with FSDP CPU offload configuration reference, integration examples with training_hub.sft and FSDPOptions, and comprehensive configuration blocks for multiple GPU architectures (H200, H100, A100, L40S) with tuning parameters (max_tokens_per_gpu, max_seq_len, nproc_per_node, cpu_offload_params).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding InstructLab multi-phase training hardware configurations to the examples directory, which directly corresponds to the added README and Jupyter notebook files.
Description check	✅ Passed	The description is directly related to the changeset, explaining that existing hardware configuration values are being added for various GPU setups along with a notebook demonstrating their usage in the multiphase cookbooks.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.5)

examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb

Unexpected end of JSON input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jiridanek · 2026-03-12T20:06:24Z

@coderabbitai full review

coderabbitai · 2026-03-12T20:06:36Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb`:
- Around line 147-153: The H100 1x GPU Configuration block is incomplete and not
runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0),
define max_seq_len, and set cpu_offload_params to the validated boolean (or
explicit tuning value) so the snippet can be copy/pasted (e.g., set
max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended
sequence length, leave nproc_per_node as 1 if correct); alternatively replace
the entire H100 1x GPU Configuration section with a clear non-runnable note
stating "not yet validated" so users don't try to execute it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ecc3c6b4-bed9-4a5f-a7d3-4f1c88262851

📥 Commits

Reviewing files that changed from the base of the PR and between 37af793 and 52debb4.

📒 Files selected for processing (2)

examples/instructlab-multiphase-configs/README.md
examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb

coderabbitai · 2026-03-12T20:10:17Z

+   "source": [
+    "# H100 1x GPU Configuration\n",
+    "max_tokens_per_gpu = 0  # TO BE FILLED\n",
+    "nproc_per_node = 1\n",
+    "\n",
+    "# FSDP CPU offloading configuration\n",
+    "cpu_offload_params = False  # TO BE FILLED - True to enable CPU offloading, False to disable"


⚠️ Potential issue | 🟠 Major

Remove or complete the unfinished H100 1x config before merging.

Line 149 publishes max_tokens_per_gpu = 0, and Lines 147-153 never define max_seq_len. That makes this block unusable as a copy/paste example and can break outside a stateful notebook session. Either add tested values for all parameters or replace this with a non-runnable “not yet validated” note.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/instructlab-multiphase-configs/lab_multiphase_configs.ipynb` around lines 147 - 153, The H100 1x GPU Configuration block is incomplete and not runnable: fill in realistic, tested values for max_tokens_per_gpu (replace 0), define max_seq_len, and set cpu_offload_params to the validated boolean (or explicit tuning value) so the snippet can be copy/pasted (e.g., set max_tokens_per_gpu to a non-zero value, add max_seq_len with the intended sequence length, leave nproc_per_node as 1 if correct); alternatively replace the entire H100 1x GPU Configuration section with a clear non-runnable note stating "not yet validated" so users don't try to execute it.

Maxusmusti added 2 commits August 28, 2025 14:34

Adding baseline hardware values for lab multiphase configs

af37594

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

Add README for lab configs

30b3c2b

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

astefanutti reviewed Aug 29, 2025

View reviewed changes

Add disclaimer for single-node, multi-node configs

52debb4

Signed-off-by: Mustafa Eyceoz <meyceoz@redhat.com>

astefanutti reviewed Oct 10, 2025

View reviewed changes

tarun-etikala referenced this pull request in tarun-e/red-hat-ai-examples Nov 10, 2025

Merge pull request #3 from shricharan-ks/feature/refactor

85428f0

Feature/refactor

coderabbitai Bot reviewed Mar 12, 2026

View reviewed changes

		@@ -0,0 +1,16 @@
		# InstructLab Multi-Phase Training Configurations


		See [`lab_multiphase_configs.ipynb`](./lab_multiphase_configs.ipynb) for optimized training parameters for various hardware configurations including:

		- H200: 1x, 2x, 4x, 8x GPU configurations


		Each configuration includes memory-optimized settings for `max_tokens_per_gpu`, `max_seq_len`, `nproc_per_node`, and FSDP CPU offloading parameters.

		Note: The values are all set assuming a single node with the above GPU resources. For multi-node, note that the default sharding strategy is FSDP [HYBRID_SHARD](https://docs.pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy). Begin with the cooresponding settings that align with one of your given nodes, or switch to `FULL_SHARD` if required due to memory constraints.

Conversation

Maxusmusti commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiridanek commented Mar 12, 2026

Uh oh!

coderabbitai Bot commented Mar 12, 2026

Uh oh!

coderabbitai Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

jiridanek commented Mar 12, 2026

Uh oh!

coderabbitai Bot commented Mar 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Mar 12, 2026 •

edited

Loading