Synchronize end of loader for repeat=False by voegtlel · Pull Request #146 · NVIDIA/Megatron-Energon

voegtlel · 2025-07-01T10:00:48Z

Provides two strategies for synchronizing the end of the loader for repeat=False.

docs/source/advanced/epochized_blending.md

philipp-fischer · 2025-07-02T08:14:48Z

docs/source/advanced/epochized_blending.md

-loader = get_loader(get_train_dataset(
+# Set repeat=False to avoid repeating the dataset.
+# Also add RedistributeLoader to synchronize the end of rank exhaustion. Only works with initialized torch distributed.
+loader = RedistributeLoader(get_loader(get_train_dataset(


I think it would be better if this was transparent to the user, so make get_loader handle this internally.
And since the choice of RedistributeLoader vs. StopFirstLoader actually changes the data that's being iterated, this choice should be made in the metadataset and not in the code, I think.
As a property of blend_epochized.

I.e. blend_epochized can either be a list as before (chooses default RedistributeLoader), or it can be a dict for more customization like

blend_epochized: phase_out_behavior: stop_first_loader datasets: - repetitions: 5 path: ./coco # ... Other parameters - repetitions: 2 path: ./coyo - repetitions: 1 path: ./coyo split_part: val

Okay, we can make this be handled in get_loader. I personally would prefer it to be separate though, as it's a feature on top?

Regarding moving the configuration to the metadataset:
I see your point that this slightly modifies how the data is iterated, but I'd also argue:

So far we don't really rely on torch distributed, where this piece of code is tightly bound to it.

This would also disable nesting of blend_epochized, because you cannot nest different (or unconfigured) phase_out_behavior.

This depends on repeat=False and doesn't make sense if repeat=True, so it's based on what the user sets in the code.

At least for RedistributeLoader it should not really change the data frequency (so far the settings in the metadataset mainly focuses on data frequency / blend).

If we move the boundary of metadataset to include this, then we should also have gradient accumulation, seeds, batch size, handling of incomplete batches, etc. in the config. I wouldn't want that, tbh.

Thus voting for keeping this in code, not in the metadataset config.

docs/source/advanced/epochized_blending.md

philipp-fischer · 2025-07-02T08:37:07Z

src/megatron/energon/sync_end/redistribute.py

+                except StopIteration:
+                    # print(f"[r={rank}]: StopIteration\n", end="")
+                    self.exhausted_states[rank] = self_exhausted = 1
+            dist.all_reduce(


We should evaluate the impact of this synchronization (which happens in every iteration!) in a real world training to see if the training speed suffers when we have many nodes and ranks.

Yes, agreed. Didn't benchmark the impact of this yet.

philipp-fischer · 2025-07-02T08:51:09Z

src/megatron/energon/sync_end/redistribute.py

+        return f"RedistributeLoaderState(inner_state={self.inner_state!r}, exhausted_state={self.exhausted_state!r}, overuse_count={self.overuse_count!r})"
+
+
+class RedistributeLoader(Generic[T]):


I think this will break our reproducible scaling!
It seems the whole redistribution loader will not work if we stop and resume with a different number of ranks.
See #80

That's a problem we need to discuss possible solutions for.

Okay, haven't thought of this tbh. Yes, let's discuss offline

philipp-fischer · 2025-07-02T08:51:45Z

src/megatron/energon/sync_end/stop_first_end.py

+        return f"StopFirstDataLoaderState(inner_state={self.inner_state!r}, iterating_from_start={self.iterating_from_start!r}, next_sample_restore_key={self.next_sample_restore_key!r})"
+
+
+class StopFirstLoader(Generic[T]):


Same issue with reproducible scaling here? I think this one could be easier to get it working without breaking repro scaling.

Likely easier, but it's also the less useful of the two methods 😅

voegtlel added 5 commits June 30, 2025 14:04

Small test fixes

c12a188

Implement synchronization at the end of dataset loader with repeat=False

476c739

Add end of dataset sync to documentation.

a0402fc

Just fix

7ec2298

Fix sync test import

64b7f6b

voegtlel requested a review from philipp-fischer July 2, 2025 07:30

philipp-fischer requested changes Jul 2, 2025

View reviewed changes

Small doc adaption

6a8d94d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Synchronize end of loader for repeat=False#146

Synchronize end of loader for repeat=False#146
voegtlel wants to merge 6 commits intodevelopfrom
feature/end_of_loader_sync

voegtlel commented Jul 1, 2025

Uh oh!

Uh oh!

philipp-fischer Jul 2, 2025

Uh oh!

voegtlel Jul 2, 2025

Uh oh!

Uh oh!

philipp-fischer Jul 2, 2025

Uh oh!

voegtlel Jul 2, 2025

Uh oh!

philipp-fischer Jul 2, 2025

Uh oh!

voegtlel Jul 2, 2025

Uh oh!

philipp-fischer Jul 2, 2025

Uh oh!

voegtlel Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return f"RedistributeLoaderState(inner_state={self.inner_state!r}, exhausted_state={self.exhausted_state!r}, overuse_count={self.overuse_count!r})"


		class RedistributeLoader(Generic[T]):

		return f"StopFirstDataLoaderState(inner_state={self.inner_state!r}, iterating_from_start={self.iterating_from_start!r}, next_sample_restore_key={self.next_sample_restore_key!r})"


		class StopFirstLoader(Generic[T]):

Comments

Conversation

voegtlel commented Jul 1, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants