Skip to content

Checkpointing Benchmark: Invalid Submission Due to Incorrect Operation Count #365

@humengyu2012

Description

@humengyu2012

Overview

When running the MLPStorage checkpointing benchmark with separate write and read phases, the report generator incorrectly counts the total number of checkpoint operations, resulting in an INVALID submission.

Test Script

export TEST_TIME=05082006

# Phase 1: Write 10 checkpoints (read disabled)
~/storage/mlpstorage checkpointing run --model llama3-8b \
  --closed \
  --hosts localhost \
  --num-processes 8 \
  --num-checkpoints-read 0 \
  --checkpoint-folder /mnt/alluxio/ckpts/8acc-${TEST_TIME} \
  --results-dir ~/mlps_results \
  --client-host-memory-in-gb 768 --file

# Clear page cache between phases
sudo sync
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"

sleep 10

# Phase 2: Read 10 checkpoints (write disabled)
~/storage/mlpstorage checkpointing run --model llama3-8b \
  --closed \
  --hosts localhost \
  --num-processes 8 \
  --num-checkpoints-write 0 \
  --checkpoint-folder /mnt/alluxio/ckpts/8acc-${TEST_TIME} \
  --results-dir ~/mlps_results \
  --client-host-memory-in-gb 768 --file

Intent

  • Run 10 checkpoint writes in the first phase (with --num-checkpoints-read 0 to skip reads).
  • Clear the OS page cache to ensure cold-cache read performance.
  • Run 10 checkpoint reads in the second phase (with --num-checkpoints-write 0 to skip writes).

Results Directory Structure

The two runs produce two separate result directories:

.
└── checkpointing
    └── llama3-8b
        ├── 20260508_103731    # Phase 1 (write)
        │   ├── {0..7}_output.json
        │   ├── {0..7}_per_epoch_stats.json
        │   ├── checkpointing_20260508_103731_metadata.json
        │   ├── checkpointing_20260508_103731_timeseries.json
        │   ├── checkpointing_run.stderr.log
        │   ├── checkpointing_run.stdout.log
        │   ├── dlio.log
        │   ├── dlio_config/
        │   └── summary.json
        └── 20260508_104334    # Phase 2 (read)
            ├── {0..7}_output.json
            ├── {0..7}_per_epoch_stats.json
            ├── checkpointing_20260508_104334_metadata.json
            ├── checkpointing_20260508_104334_timeseries.json
            ├── checkpointing_run.stderr.log
            ├── checkpointing_run.stdout.log
            ├── dlio.log
            ├── dlio_config/
            └── summary.json

Report Generation & Error

Command

~/storage/mlpstorage reports reportgen --file --results-dir ./mlps_results_test

Output

2026-05-08 12:56:45|INFO: Directory validation passed: found 2 runs in 1 benchmark types
2026-05-08 12:56:45|INFO: Created benchmark run: checkpointing_run_llama3-8b_20260508_103731
2026-05-08 12:56:45|INFO: Created benchmark run: checkpointing_run_llama3-8b_20260508_104334
2026-05-08 12:56:45|INFO: Accumulating results from 2 runs
...
2026-05-08 12:56:45|ERROR: INVALID: [INVALID] Expected 10 total read operations, but found 20
2026-05-08 12:56:45|ERROR: INVALID: [INVALID] Expected 10 total write operations, but found 20

Validation Report (Excerpt)

[INVALID] Checkpointing - llama3-8b
    Issues:
      [INVALID] checkpoint.num_checkpoints_read: Expected 10 total read operations, but found 20
      [INVALID] checkpoint.num_checkpoints_write: Expected 10 total write operations, but found 20

Checkpointing Benchmark CLOSED Requirements
===========================================
Requirements:
  [ ] 10 checkpoint write operations total
  [ ] 10 checkpoint read operations total
  [ ] Valid LLM model (llama3-8b, llama3-70b, llama3-405b)
  [ ] Only allowed parameter overrides used

Problem Analysis

The benchmark expects a combined total of 10 writes and 10 reads across all runs. However, it appears that setting --num-checkpoints-read 0 or --num-checkpoints-write 0 does not actually result in 0 operations being recorded for that phase. Instead, each run still reports the default of 10 for both read and write operations.

Since there are 2 runs, the report generator sums them up:

Operation Run 1 (Write Phase) Run 2 (Read Phase) Total Expected
Writes 10 10 20 10
Reads 10 10 20 10

Root Cause: The --num-checkpoints-read 0 and --num-checkpoints-write 0 flags appear to be ignored or not properly reflected in the recorded metadata. Each run still reports 10 read and 10 write operations regardless of these flags, causing the aggregated totals to double.

Expected Behavior

When --num-checkpoints-read 0 is set, the run should record 0 read operations (and vice versa for writes). The correct totals should be:

Operation Run 1 (Write Phase) Run 2 (Read Phase) Total Expected
Writes 10 0 10 10 ✅
Reads 0 10 10 10 ✅

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions