Skip to content

Bench Shuffle Pinned Memory Debugging #1010

@quasiben

Description

@quasiben

When running current bench_shuffle benchmarks on an NVL72 I'm seeing an error then pinned memory is enabled (this is the default)

APP_ARGS="-C ucxx -w 3 -r 10 -g -s -x -l $((40 * 1024)) -p ${NUM_PARTITIONS} -o 4 -c ${NUM_COLUMNS} -n ${NUM_ROWS} -x -m async"

  -c ucxx (communicator)
  -r 10 (number of runs)
  -w 3 (number of warmup runs)
  -c 4 (number of columns)
  -n 67108864 (number of rows per rank)
  -p 50 (number of input partitions per rank)
  -o 4 (number of output partitions per rank)
  -m async (RMM memory resource)
  -l 40960 (device memory limit in MiB)
  -s (enable output discard to simulate streaming)
  -x (enable memory profiling)
  -g (use pre-partitioned input tables)
Local size: 50 GiB

terminate called after throwing an instance of 'cuda::__4::cuda_error'
  what():  /opt/conda/envs/rapidsmpf/include/rapids/cuda/__driver/driver_api.h:152 out of memory(2): Failed to allocate memory from a memory pool
[presto-gb200-gcn-07:1222606] *** Process received signal ***
[presto-gb200-gcn-07:1222606] Signal: Aborted (6)

When I disable pinned memory with -L the job completes without issue. One interesting note is that this does not seem to reproduce on a cluster made up of DGXH100s

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions