Skip to content

Support hyperedge-based sampling in Dataset #66

@tizianocitro

Description

@tizianocitro

Dataset.__getitem__ currently samples by node: given a node index, it returns all hyperedges incident to that node. This works for node-centric tasks but is suboptimal for HLP, where the prediction unit is a hyperedge.

Problems with node-based sampling for HLP:

  • __len__ returns num_nodes, so batch count is independent of the number of hyperedges (adding negative samples doesn't change batch count)
  • A sampled node pulls in all its hyperedges, giving unpredictable and uneven hyperedge coverage per batch.
  • Multiple sampled nodes can pull in the same hyperedge, requiring deduplication in collate

I want to add a configurable sampling strategy to Dataset so that __getitem__ can sample either by node or by hyperedge.

When sampling by hyperedge:

  • __len__ returns num_hyperedges.
  • __getitem__(i) returns the i-th hyperedge (with all its incidences).
  • Each batch contains a fixed number of hyperedges, giving direct control over positive/negative ratio per batch.
  • No deduplication needed in collate since each hyperedge is returned exactly once.

When sampling by node (current behavior):

  • __len__ returns num_nodes.
  • __getitem__(i) returns all hyperedges incident to node i.
  • Deduplication in collate handles overlapping hyperedges.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions