Support hyperedge-based sampling in `Dataset`

`Dataset.__getitem__` currently samples by node: given a node index, it returns all hyperedges incident to that node. This works for node-centric tasks but is suboptimal for HLP, where the prediction unit is a hyperedge.

Problems with node-based sampling for HLP:
- `__len__` returns `num_nodes`, so batch count is independent of the number of hyperedges (adding negative samples doesn't change batch count)
- A sampled node pulls in all its hyperedges, giving unpredictable and uneven hyperedge coverage per batch.
- Multiple sampled nodes can pull in the same hyperedge, requiring deduplication in `collate`

I want to add a configurable sampling strategy to `Dataset` so that `__getitem__` can sample either by node or by hyperedge.

When sampling by hyperedge:
- `__len__` returns `num_hyperedges`.
- `__getitem__(i)` returns the i-th hyperedge (with all its incidences).
- Each batch contains a fixed number of hyperedges, giving direct control over positive/negative ratio per batch.
- No deduplication needed in `collate` since each hyperedge is returned exactly once.

When sampling by node (current behavior):
- `__len__` returns `num_nodes`.
- `__getitem__(i)` returns all hyperedges incident to node `i`.
- Deduplication in `collate` handles overlapping hyperedges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support hyperedge-based sampling in `Dataset` #66

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support hyperedge-based sampling in Dataset #66

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support hyperedge-based sampling in `Dataset` #66