Skip to content

Historical negative sampling#406

Merged
ntgbaoo merged 8 commits into
mainfrom
405-HistoricalSampling
May 15, 2026
Merged

Historical negative sampling#406
ntgbaoo merged 8 commits into
mainfrom
405-HistoricalSampling

Conversation

@ntgbaoo
Copy link
Copy Markdown
Member

@ntgbaoo ntgbaoo commented May 11, 2026

Summary / Description

This PR handles:

  • Added historical negative sampling
  • Reorganized hooks into sub-modules (if needed)
  • Split negative samplers into files
  • Added base class for tgb negative samplers (tgbl, tkgl, thgl) to avoid duplicated code

Related Issues: #405

Type of Change

  • Bug fix
  • New feature
  • Breaking Change
  • Refactoring
  • Documentation update

Test Evidence

Describe how this PR has been tested.

  • Unit tests
  • Integration tests
  • Performance tests

Questions / Discussion Points

List any areas where you’d like reviewer input or have open questions.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 96.95946% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tgm/hooks/negatives/tgb_sampler.py 91.42% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread tgm/hooks/negatives.py Outdated
Comment on lines +237 to +238
self._memory[0, self._count : self._count + batch_size] = batch.edge_src
self._memory[1, self._count : self._count + batch_size] = batch.edge_dst
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One drawback of this is that the amortized space complexity is O(number of observed edge events) rather than O(number of edges).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this better? you mean we have to keep a buffer in space now?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we want the space complexity of this to scale linearly w.r.t the number of unique edges in the graphs. However, the amortized space complexity of this implementation is O(number of observed edge events), which means the memory will contain duplicated edges. One benefit of this implementation: it naturally enforces that edges, which appear more frequently in the past, would have higher probabilities to be sampled.

@ntgbaoo ntgbaoo self-assigned this May 11, 2026
@ntgbaoo ntgbaoo added enhancement New feature or request DGHooks Related to hooks or hook management DGBatch labels May 11, 2026
Copy link
Copy Markdown
Collaborator

@shenyangHuang shenyangHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bao, two major things:

  1. historical negative and random negative should be two hooks, one is stateful, one is stateless
  2. I think we should now seriously consider having a folder or a base class that handles negative sampling and shouldn't throw everything in a single script anymore

Comment thread tgm/hooks/negatives.py Outdated


class NegativeEdgeSamplerHook(StatelessHook):
class NegativeEdgeSamplerHook(StatefulHook):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have two classes, one for random negative, one for historical negative, it is easier conceptually for me to understand as well. Then I can mix and match the two by having two hooks in my hm.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also one of them is stateless, one of them is stateful, we should definitely separate them. Otherwise, by same logic, we should merge uniform sampling with recency sampling

Comment thread tgm/hooks/negatives.py Outdated

For each source node in the batch, randomly selects a destination node from
its past interactions stored in memory. If a source node has no recorded past
interactions, its corresponding negative sample is set to PADDED_NODE_ID as
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

negative sample set to PADDED_NODE_ID is fine, but we need to remind users to mask those out correctly. Alternatively the hook can let you know how much is padded?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will return PADDED_NODE_ID for nodes that don't have past interactions. However, PADDED_NODE_ID will be replaced with random dsts before returning. Here is the logic from __call__:

elif self.strategy == 'hist_rnd':
  if self._count == 0:
      neg, neg_time = self._random_sampling(dg, batch)
      neg, neg_time = neg[: size[0]], neg_time[: size[0]]

  else: #replace PADDED_NODE_ID with random dst
      rnd_size = round(size[0] * 0.5)
      hst_size = size[0] - rnd_size
      neg_rnd, neg_time_rnd = self._random_sampling(dg, batch)
      neg_hst, neg_time_hst = self._random_hist_sampling(dg, batch)

      original_valid_mask = neg_hst != PADDED_NODE_ID
      valid_idx = torch.where(original_valid_mask)[0]
      cutoff = min(hst_size, valid_idx.size(0))

      neg = neg_rnd.clone()
      neg_time = neg_time_rnd.clone()

      chosen = valid_idx[:cutoff]
      neg[chosen] = neg_hst[chosen]
      neg_time[chosen] = neg_time_hst[chosen]

So PADDED_NODE_ID won't be propagated to downstream, and for nodes that don't have past interactions, we use random sampling

Comment thread tgm/hooks/negatives.py Outdated
Comment on lines +237 to +238
self._memory[0, self._count : self._count + batch_size] = batch.edge_src
self._memory[1, self._count : self._count + batch_size] = batch.edge_dst
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this better? you mean we have to keep a buffer in space now?

@ntgbaoo ntgbaoo requested a review from shenyangHuang May 14, 2026 00:45
Copy link
Copy Markdown
Collaborator

@shenyangHuang shenyangHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bao, looks good on my end

@ntgbaoo ntgbaoo merged commit aadb103 into main May 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DGBatch DGHooks Related to hooks or hook management enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants