-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Dataset.__getitem__ currently samples by node: given a node index, it returns all hyperedges incident to that node. This works for node-centric tasks but is suboptimal for HLP, where the prediction unit is a hyperedge.
Problems with node-based sampling for HLP:
__len__returnsnum_nodes, so batch count is independent of the number of hyperedges (adding negative samples doesn't change batch count)- A sampled node pulls in all its hyperedges, giving unpredictable and uneven hyperedge coverage per batch.
- Multiple sampled nodes can pull in the same hyperedge, requiring deduplication in
collate
I want to add a configurable sampling strategy to Dataset so that __getitem__ can sample either by node or by hyperedge.
When sampling by hyperedge:
__len__returnsnum_hyperedges.__getitem__(i)returns the i-th hyperedge (with all its incidences).- Each batch contains a fixed number of hyperedges, giving direct control over positive/negative ratio per batch.
- No deduplication needed in
collatesince each hyperedge is returned exactly once.
When sampling by node (current behavior):
__len__returnsnum_nodes.__getitem__(i)returns all hyperedges incident to nodei.- Deduplication in
collatehandles overlapping hyperedges.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels