Saving loading hook state by NaziaHossain066 · Pull Request #400 · tgm-team/tgm

NaziaHossain066 · 2026-04-11T07:18:23Z

Summary / Description

Adds checkpointing support for stateful hooks in TGM, enabling training to be resumed from mid-epoch interruptions without losing hook state or RNG reproducibility. Demonstrates end-to-end checkpoint save/resume in examples/linkproppred/tgat.py.

Changes:

Added state_dict() / load_state_dict() to DGHook, StatelessHook, StatefulHook, RecencyNeighborHook, EdgeEventsSeenNodesTrackHook, NodeAnalyticsHook, and HookManager (with id()-based deduplication for shared hooks)
Added skip_batches to DGDataLoader to skip already-processed batches on resume without running hooks on them
Overrode _get_iterator in DGDataLoader to neutralize PyTorch's internal RNG consumption, ensuring resumed runs are bit-exact
Updated examples/linkproppred/tgat.py with save_checkpoint, load_checkpoint, and --resume / --checkpoint-dir args
Added 9 unit tests in test/unit/test_hooks/test_state_hook.py
Related Issues: # (395)

Type of Change

Test Evidence

Describe how this PR has been tested.

Unit tests
Integration tests
Performance tests

Questions / Discussion Points

Should checkpointing be extended to other example files beyond tgat.py

…resume to tgat.py

…ume reproducibility

codecov · 2026-04-11T22:20:21Z

Codecov Report

❌ Patch coverage is 77.90698% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tgm/hooks/node_analytics.py	33.33%	12 Missing ⚠️
tgm/hooks/hook_manager.py	89.28%	3 Missing ⚠️
tgm/hooks/node_tracks.py	50.00%	2 Missing ⚠️
tgm/hooks/base.py	85.71%	1 Missing ⚠️
tgm/hooks/neighbors.py	91.66%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

ntgbaoo

Hey @NaziaHossain066

Thanks for the PR. Great contribution!

Overall, the PR looks pretty good. I left a couple of comments and raised some concerns for us to discuss.

I just approved CI workflows to run for this PR. It is worth checking codecovand adding some unit tests to cover some lines that are missing tests rn, such as:

state_dict and load_state_dict for node analytics and node tracks

ntgbaoo · 2026-04-11T22:24:12Z

    decoder: nn.Module,
    opt: torch.optim.Optimizer,
+    epoch: int,
+    hm: object,


Suggested change

hm: object,

hm: HookManager,

ntgbaoo · 2026-04-11T22:26:04Z

    opt: torch.optim.Optimizer,
+    epoch: int,
+    hm: object,
+    nbr_hook: object,


Suggested change

nbr_hook: object,

nbr_hook: StatefulHook,

ntgbaoo · 2026-04-11T22:27:15Z

+        print(
+            f'epoch {epoch} batch {batch_idx} loss {float(loss):.6f} write_pos_sum {nbr_hook._write_pos.sum().item():.0f}'
+        )


Suggested change

print(

f'epoch {epoch} batch {batch_idx} loss {float(loss):.6f} write_pos_sum {nbr_hook._write_pos.sum().item():.0f}'

)

ntgbaoo · 2026-04-11T22:32:14Z

+            f'epoch {epoch} batch {batch_idx} loss {float(loss):.6f} write_pos_sum {nbr_hook._write_pos.sum().item():.0f}'
+        )
+
+        if batch_idx % 100 == 0:


Would be nice to have a variable called checkpoint_interval from the parser instead of hardcoding 100.

I think another flag is needed to indicate whether we want to checkpoint the model as well. Such as train_checkpoint, then the condition will have

if batch_idx % checkpoint_interval == 0 && args.train_checkpoint:

should we just checkpoint only after each epoch though this also makes sense for larger datasets

ntgbaoo · 2026-04-11T22:36:05Z


-for epoch in range(1, args.epochs + 1):
+
+def save_checkpoint(epoch, batch_idx, encoder, decoder, opt, hm):


Can we add a type for each input variable for this function? And would be nice to have a comment to describe the purpose of this method as well.

ntgbaoo · 2026-04-11T23:11:18Z

+        # PyTorch's _BaseDataLoaderIter.__init__ consumes 2 global RNG samples for
+        # an internal base_seed (used only for worker processes). Save and restore
+        # around iterator creation so the global RNG state is unaffected, making
+        # checkpoint resume produce identical results to an uninterrupted run.


Suggested change

# PyTorch's _BaseDataLoaderIter.__init__ consumes 2 global RNG samples for

# an internal base_seed (used only for worker processes). Save and restore

# around iterator creation so the global RNG state is unaffected, making

# checkpoint resume produce identical results to an uninterrupted run.

"""PyTorch's _BaseDataLoaderIter.__init__ consumes 2 global RNG samples for

an internal base_seed (used only for worker processes). Save and restore

around iterator creation so the global RNG state is unaffected, making

checkpoint resume produce identical results to an uninterrupted run.

"""

ntgbaoo · 2026-04-11T23:16:14Z

+parser.add_argument(
+    '--checkpoint-dir',
+    type=str,
+    default='outputs/checkpoints',


nit:

Suggested change

default='outputs/checkpoints',

default='artifact/checkpoints',

ntgbaoo · 2026-04-11T23:16:53Z

 # Remove previous ipynb_checkpoints
 #   git rm -r .ipynb_checkpoints/
-
+outputs/


nit

Suggested change

outputs/

artifact/

ntgbaoo · 2026-04-12T00:31:20Z


    def reset_state(self) -> None: ...

+    def state_dict(self) -> dict: ...


Suggested change

def state_dict(self) -> dict: ...

@property

def state_dict(self) -> dict: ...

ntgbaoo · 2026-04-12T00:51:27Z

+            '_all_neighbors': self._all_neighbors,
+            '_engagement_sum': self._engagement_sum,
+            '_seen_edges': self._seen_edges,
+            '_tracked_mask': self._tracked_mask.cpu().clone(),


Do we actually need this?

shenyangHuang · 2026-05-08T14:40:35Z

yes I agree

shenyangHuang · 2026-05-08T14:41:51Z

+            f'epoch {epoch} batch {batch_idx} loss {float(loss):.6f} write_pos_sum {nbr_hook._write_pos.sum().item():.0f}'
+        )
+
+        if batch_idx % 100 == 0:


should we just checkpoint only after each epoch though this also makes sense for larger datasets

shenyangHuang · 2026-05-08T14:43:14Z

+    return ckpt['epoch'], ckpt['batch_idx'], ckpt.get('best_val', 0.0)
+
+
+def find_latest_checkpoint(directory):


this should be a helper function in util as Bao suggested

shenyangHuang · 2026-05-08T14:44:52Z

+    )
+
+
+def load_checkpoint(path, encoder, decoder, opt, hm):


potentially need to make this verbose by default, user should want to know when and where the checkpoint was resumed from

NaziaHossain066 added 3 commits April 10, 2026 20:33

Add state_dict/load_state_dict to stateful hooks and checkpoint save/…

7927840

…resume to tgat.py

Add unit tests for stateful hook checkpointing and fix checkpoint res…

93490a2

…ume reproducibility

Add --resume/--checkpoint-dir args and save best_val in checkpoints

cbbf899

ntgbaoo requested review from Jacob-Chmura, ntgbaoo and shenyangHuang April 11, 2026 22:18

ntgbaoo assigned NaziaHossain066 Apr 11, 2026

ntgbaoo linked an issue Apr 11, 2026 that may be closed by this pull request

Saving and Loading Hook State #395

Open

ntgbaoo reviewed Apr 12, 2026

View reviewed changes

shenyangHuang requested changes May 8, 2026

View reviewed changes

	print(
	f'epoch {epoch} batch {batch_idx} loss {float(loss):.6f} write_pos_sum {nbr_hook._write_pos.sum().item():.0f}'
	)


		for epoch in range(1, args.epochs + 1):

		def save_checkpoint(epoch, batch_idx, encoder, decoder, opt, hm):

	default='outputs/checkpoints',
	default='artifact/checkpoints',


		def reset_state(self) -> None: ...

		def state_dict(self) -> dict: ...

	def state_dict(self) -> dict: ...
	@property
	def state_dict(self) -> dict: ...

		return ckpt['epoch'], ckpt['batch_idx'], ckpt.get('best_val', 0.0)


		def find_latest_checkpoint(directory):

Conversation

NaziaHossain066 commented Apr 11, 2026

Summary / Description

Type of Change

Test Evidence

Questions / Discussion Points

Uh oh!

codecov Bot commented Apr 11, 2026

Codecov Report

Uh oh!

ntgbaoo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants