Overview
When running the MLPStorage checkpointing benchmark with separate write and read phases, the report generator incorrectly counts the total number of checkpoint operations, resulting in an INVALID submission.
Test Script
export TEST_TIME=05082006
# Phase 1: Write 10 checkpoints (read disabled)
~/storage/mlpstorage checkpointing run --model llama3-8b \
--closed \
--hosts localhost \
--num-processes 8 \
--num-checkpoints-read 0 \
--checkpoint-folder /mnt/alluxio/ckpts/8acc-${TEST_TIME} \
--results-dir ~/mlps_results \
--client-host-memory-in-gb 768 --file
# Clear page cache between phases
sudo sync
sudo bash -c "echo 3 > /proc/sys/vm/drop_caches"
sleep 10
# Phase 2: Read 10 checkpoints (write disabled)
~/storage/mlpstorage checkpointing run --model llama3-8b \
--closed \
--hosts localhost \
--num-processes 8 \
--num-checkpoints-write 0 \
--checkpoint-folder /mnt/alluxio/ckpts/8acc-${TEST_TIME} \
--results-dir ~/mlps_results \
--client-host-memory-in-gb 768 --file
Intent
- Run 10 checkpoint writes in the first phase (with
--num-checkpoints-read 0 to skip reads).
- Clear the OS page cache to ensure cold-cache read performance.
- Run 10 checkpoint reads in the second phase (with
--num-checkpoints-write 0 to skip writes).
Results Directory Structure
The two runs produce two separate result directories:
.
└── checkpointing
└── llama3-8b
├── 20260508_103731 # Phase 1 (write)
│ ├── {0..7}_output.json
│ ├── {0..7}_per_epoch_stats.json
│ ├── checkpointing_20260508_103731_metadata.json
│ ├── checkpointing_20260508_103731_timeseries.json
│ ├── checkpointing_run.stderr.log
│ ├── checkpointing_run.stdout.log
│ ├── dlio.log
│ ├── dlio_config/
│ └── summary.json
└── 20260508_104334 # Phase 2 (read)
├── {0..7}_output.json
├── {0..7}_per_epoch_stats.json
├── checkpointing_20260508_104334_metadata.json
├── checkpointing_20260508_104334_timeseries.json
├── checkpointing_run.stderr.log
├── checkpointing_run.stdout.log
├── dlio.log
├── dlio_config/
└── summary.json
Report Generation & Error
Command
~/storage/mlpstorage reports reportgen --file --results-dir ./mlps_results_test
Output
2026-05-08 12:56:45|INFO: Directory validation passed: found 2 runs in 1 benchmark types
2026-05-08 12:56:45|INFO: Created benchmark run: checkpointing_run_llama3-8b_20260508_103731
2026-05-08 12:56:45|INFO: Created benchmark run: checkpointing_run_llama3-8b_20260508_104334
2026-05-08 12:56:45|INFO: Accumulating results from 2 runs
...
2026-05-08 12:56:45|ERROR: INVALID: [INVALID] Expected 10 total read operations, but found 20
2026-05-08 12:56:45|ERROR: INVALID: [INVALID] Expected 10 total write operations, but found 20
Validation Report (Excerpt)
[INVALID] Checkpointing - llama3-8b
Issues:
[INVALID] checkpoint.num_checkpoints_read: Expected 10 total read operations, but found 20
[INVALID] checkpoint.num_checkpoints_write: Expected 10 total write operations, but found 20
Checkpointing Benchmark CLOSED Requirements
===========================================
Requirements:
[ ] 10 checkpoint write operations total
[ ] 10 checkpoint read operations total
[ ] Valid LLM model (llama3-8b, llama3-70b, llama3-405b)
[ ] Only allowed parameter overrides used
Problem Analysis
The benchmark expects a combined total of 10 writes and 10 reads across all runs. However, it appears that setting --num-checkpoints-read 0 or --num-checkpoints-write 0 does not actually result in 0 operations being recorded for that phase. Instead, each run still reports the default of 10 for both read and write operations.
Since there are 2 runs, the report generator sums them up:
| Operation |
Run 1 (Write Phase) |
Run 2 (Read Phase) |
Total |
Expected |
| Writes |
10 |
10 |
20 |
10 |
| Reads |
10 |
10 |
20 |
10 |
Root Cause: The --num-checkpoints-read 0 and --num-checkpoints-write 0 flags appear to be ignored or not properly reflected in the recorded metadata. Each run still reports 10 read and 10 write operations regardless of these flags, causing the aggregated totals to double.
Expected Behavior
When --num-checkpoints-read 0 is set, the run should record 0 read operations (and vice versa for writes). The correct totals should be:
| Operation |
Run 1 (Write Phase) |
Run 2 (Read Phase) |
Total |
Expected |
| Writes |
10 |
0 |
10 |
10 ✅ |
| Reads |
0 |
10 |
10 |
10 ✅ |
Overview
When running the MLPStorage checkpointing benchmark with separate write and read phases, the report generator incorrectly counts the total number of checkpoint operations, resulting in an
INVALIDsubmission.Test Script
Intent
--num-checkpoints-read 0to skip reads).--num-checkpoints-write 0to skip writes).Results Directory Structure
The two runs produce two separate result directories:
Report Generation & Error
Command
~/storage/mlpstorage reports reportgen --file --results-dir ./mlps_results_testOutput
Validation Report (Excerpt)
Problem Analysis
The benchmark expects a combined total of 10 writes and 10 reads across all runs. However, it appears that setting
--num-checkpoints-read 0or--num-checkpoints-write 0does not actually result in 0 operations being recorded for that phase. Instead, each run still reports the default of 10 for both read and write operations.Since there are 2 runs, the report generator sums them up:
Root Cause: The
--num-checkpoints-read 0and--num-checkpoints-write 0flags appear to be ignored or not properly reflected in the recorded metadata. Each run still reports 10 read and 10 write operations regardless of these flags, causing the aggregated totals to double.Expected Behavior
When
--num-checkpoints-read 0is set, the run should record 0 read operations (and vice versa for writes). The correct totals should be: