Claude 01 by wvaske · Pull Request #24 · wvaske/mlperf-storage

wvaske · 2026-02-16T18:12:04Z

No description provided.

- Add dependency_check.py module with fail-fast validation for MPI and DLIO - Fix RunRulesChecker initialization order bug (set benchmark_run before super().__init__) - Rename reporting.py to report_generator.py to avoid module/package naming conflict - Add unit tests for dependency validation (16 tests) - Add import validation tests to catch module naming issues (14 tests) - Add rules checker initialization tests (4 tests) - Add integration tests for verification flow and dependency validation (7 tests) - Update CLAUDE.md with test environment directory - Fix test patch paths for renamed modules Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Make RunID dataclass hashable by adding frozen=True, allowing it to be used as dictionary keys in report_generator.py. Also fix the type annotation for run_results from Dict[str, Result] to Dict[RunID, Result]. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update test environment section with new data and results directories at /databases/mlps-v3.0/. Add example commands for running datagen and training benchmarks with unet3d model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

7 codebase documents generated: - STACK.md: Python 3.10+, DLIO benchmark, MPI integration - INTEGRATIONS.md: DLIO, MPI, hydra-core, PyArrow - ARCHITECTURE.md: Benchmark system with registry pattern - STRUCTURE.md: mlpstorage package layout - CONVENTIONS.md: Code style and patterns - TESTING.md: pytest-based test structure - CONCERNS.md: Technical debt and improvement areas Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

MLPerf Storage Benchmark Suite v3.0 - orchestration framework for MLCommons storage benchmarks with KV cache, VectorDB integration, and improved dependency management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Mode: yolo Depth: comprehensive Parallelization: enabled Workflow agents: research=on, plan_check=on, verifier=on Model profile: quality Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

21 requirements across 5 categories: - Package Management (3): lockfile, remove GPU deps, version validation - Benchmark Integration (5): KV cache, VectorDB classes and CLI - Training Updates (4): dlrm, retinanet, flux, parquet support - Host Collection (5): SSH collection, /proc/ data, time-series - Error Handling & UX (4): detection, suggestions, validation, progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phases: 1. Package Management Foundation: PKG-01, PKG-02, PKG-03 2. Environment Validation and Fail-Fast: UX-01, UX-02, UX-03 3. KV Cache Benchmark Integration: BENCH-01, BENCH-02 4. VectorDB Benchmark Integration: BENCH-03, BENCH-04 5. Benchmark Validation Pipeline Integration: BENCH-05 6. SSH-Based Host Collection: HOST-01, HOST-02, HOST-03 7. Time-Series Host Data Collection: HOST-04, HOST-05 8. New Training Models: TRAIN-01, TRAIN-02, TRAIN-03 9. DLIO Parquet Support: TRAIN-04 10. Progress Indication: UX-04 All 21 v1 requirements mapped to phases. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 1: Package Management Foundation - Standard stack identified (uv, importlib.metadata, packaging) - Architecture patterns documented for lockfile CLI integration - GPU dependency exclusion strategy via CPU-only indexes - Runtime version validation patterns catalogued

Phase 01: Package Management Foundation - 5 plans in 3 waves - Wave 1: 01-01 (models), 01-02 (pyproject) - parallel - Wave 2: 01-03 (generator), 01-04 (validator) - parallel - Wave 3: 01-05 (CLI integration) - depends on 03, 04 - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Task 4 to implement --verify-lockfile flag on benchmark commands. This addresses the plan checker finding that Success Criterion 4 (benchmark fails on version mismatch) was not covered. Changes: - Add Task 4: --verify-lockfile flag implementation - Add mlpstorage/cli/common_args.py to files_modified - Update verification checklist with benchmark validation tests - Add success criteria 6 and 7 for benchmark integration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add [[tool.uv.index]] section directing PyTorch to CPU wheels - Configure torch, torchvision, torchaudio to use pytorch-cpu index - Update full extra comment explaining CPU-only PyTorch usage - Prevents accidental CUDA/GPU dependencies for storage benchmark

- Add packaging>=21.0 to core dependencies - Enables PEP 440 version comparison in lockfile validator - Lightweight pure-Python dependency with no transitive deps

- Add mlpstorage/lockfile/ module with __init__.py and models.py - Define LockedPackage dataclass for package entries - Define ValidationResult dataclass for validation outcomes - Define LockfileMetadata dataclass for lockfile info - Add parse_lockfile() function for parsing requirements.txt format - Support package==version, hashes, markers, VCS/URL deps - Follow existing codebase patterns (dataclasses, type hints, docstrings)

Tasks completed: 2/2 - Add uv CPU-only index configuration - Add packaging dependency for version parsing SUMMARY: .planning/phases/01-package-management-foundation/01-02-SUMMARY.md

Tasks completed: 1/1 - Create lockfile module structure with data models SUMMARY: .planning/phases/01-package-management-foundation/01-01-SUMMARY.md

- Add LockfileGenerationError exception class with stderr/return_code tracking - Add GenerationOptions dataclass for configurable generation parameters - Implement check_uv_available() to verify uv installation - Implement generate_lockfile() wrapping uv pip compile - Implement generate_lockfiles_for_project() for base and full lockfiles - Support extras, hashes, universal, python-version, exclude-newer options

- Add validate_package() for single package version checking - Add validate_lockfile() for full lockfile validation - Add format_validation_report() for human-readable output - Use importlib.metadata for runtime version checking - Support VCS dependency skipping (git+, hg+, svn+) - Default skip mpi4py (system MPI dependency) - Return LockfileValidationResult with detailed metrics

- Add generator imports to __init__.py - Export generate_lockfile, generate_lockfiles_for_project - Export check_uv_available, LockfileGenerationError, GenerationOptions - Update module docstring to document generation features

Tasks completed: 2/2 - Task 1: Create lockfile generator module - Task 2: Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-03-SUMMARY.md

Tasks completed: 2/2 - Create lockfile validator module - Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-04-SUMMARY.md

- Create lockfile_args.py with add_lockfile_arguments function - Supports 'generate' subcommand with --output, --extra, --hashes, --python-version, --pyproject, --all - Supports 'verify' subcommand with --lockfile, --skip, --allow-missing, --strict - Follows existing CLI patterns from utility_args.py

- Add lockfile_arguments export to cli/__init__.py - Add lockfile_parsers subparser in cli_parser.py - Register add_lockfile_arguments builder - Lockfile subcommand now accessible via 'mlpstorage lockfile'

- Import lockfile module functions in main.py - Add handle_lockfile_command function to process generate/verify commands - Integrate lockfile handler into _main_impl flow - Generate command creates single or all lockfiles - Verify command validates packages and shows report - Clear error messages with actionable suggestions

- Add --verify-lockfile argument to add_universal_arguments - Validate lockfile before benchmark execution in run_benchmark - Fail with clear error message on version mismatch - Show helpful suggestions (pip install -r, uv pip sync) - Display number of verified packages on success - Available on all benchmark commands (training, checkpointing, vectordb, kvcache)

Tasks completed: 4/4 - Task 1: Create lockfile CLI argument builder - Task 2: Update CLI module exports and parser - Task 3: Add lockfile command handler to main - Task 4: Add --verify-lockfile flag to benchmark commands SUMMARY: .planning/phases/01-package-management-foundation/01-05-SUMMARY.md Phase 1 Package Management Foundation: COMPLETE - PKG-01: Lockfile generation ✓ - PKG-02: CPU-only PyTorch config ✓ - PKG-03: Runtime version validation ✓ - CLI integration ✓

Phase 1: Package Management Foundation is now complete with all requirements delivered (PKG-01, PKG-02, PKG-03). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 2: Environment Validation and Fail-Fast - Standard stack: platform, shutil.which, distro, importlib.metadata - Architecture patterns for OS detection, fail-fast validation chain - OS-specific installation instructions mapping - Integration points with existing dependency_check.py and validation_helpers.py - Common pitfalls: SSH keys, PATH issues, MPI version mismatch - Code examples extending existing codebase infrastructure Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 02: Environment Validation and Fail-Fast - 5 plan(s) in 4 wave(s) - Wave 1: OS detection module (parallel) - Wave 2: Dependency checking + SSH validation (parallel) - Wave 3: Comprehensive validator - Wave 4: Integration into execution path - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address 4 blocker issues from plan verification: 1. Plan 02-02 Task 1: Explicitly specify that check functions call detect_os() to get OSInfo, then get_install_instruction() with dependency name and OSInfo to get OS-specific install command 2. Plan 02-02 Task 2: Add verification that templates work with format_error() function, not just that they exist 3. Plan 02-03 Task 1: Add SSH binary existence check using shutil.which('ssh') BEFORE running connectivity tests, with OS-specific install hint if missing (satisfies UX-03) 4. Plan 02-05 Task 1: Clarify that validate_benchmark_environment REPLACES validate_pre_run as the main entry point. validate_pre_run is deprecated but remains for backward compatibility Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Changes made: - 09-01: Clarify all DLIO work happens in local dlio_parquet_fork/ - 09-01: Add user_setup for GitHub fork push (not autonomous) - 09-01: Fix files_modified to use actual paths (not clone instructions) - 09-01: Clarify pyproject.toml update happens AFTER user pushes fork - 09-02: Add Task 3 for end-to-end verification (TRAIN-04 requirement) - 09-02: Add truth about parquet datagen creating actual files - Both plans now clearly separate autonomous work from user actions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tasks completed: 3/3 - Clone DLIO and add PARQUET to FormatType enum - Create ParquetReader and ParquetGenerator classes - Register parquet format in factories and commit changes SUMMARY: .planning/phases/09-dlio-parquet-support/09-01-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add dlrm_parquet_h100.yaml with format: parquet, computation_time: 0.005 - Add dlrm_parquet_a100.yaml with format: parquet, computation_time: 0.007 - Add dlrm_parquet_datagen.yaml for parquet data generation - Use dlrm_parquet/ data folder to distinguish from npz-based dlrm/ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add TestTrainingParquetFormat test class with 5 tests - Verify dataset.format is in OPEN_ALLOWED_PARAMS - Verify parquet format override results in OPEN category - Verify DLRM model recognized with parquet format - Confirm validation rules handle parquet correctly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tasks completed: 3/3 - Created DLRM parquet config files (h100, a100, datagen) - Added 5 parquet validation tests - End-to-end verification: parquet datagen creates valid files SUMMARY: .planning/phases/09-dlio-parquet-support/09-02-SUMMARY.md TRAIN-04 requirement satisfied Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 10: Progress Indication - Standard stack identified: Rich library (already in dependency tree) - Architecture patterns documented for TTY detection and fallbacks - Integration points mapped across codebase - Pitfalls catalogued for terminal handling edge cases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 10: Progress Indication - 3 plans in 2 waves - 1 parallel (10-01), 2 dependent (10-02, 10-03) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Plan 10-02: - Split Task 1 into Task 1 (stage indicators) and Task 2 (spinners) - Added truth: "User sees elapsed time during 'Running benchmark...' stage" - Clarified that stage progress shows TimeElapsedColumn during benchmark Plan 10-03: - Added explicit error handling preservation notes in Task 1 - Added truth: "Error handling is preserved when operations are wrapped in progress" - Added verification test for error handling (lockfile not found) - Updated human verification checkpoint to test error propagation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add rich>=13.0 to pyproject.toml dependencies with comment - Create mlpstorage/progress.py with TTY-aware progress utilities - Implement is_interactive_terminal() for TTY detection - Implement progress_context() for determinate/indeterminate progress - Implement create_stage_progress() for multi-stage operations - Non-interactive mode logs via logger.status() fallback

- Add 20 tests covering TTY detection, interactive/non-interactive modes - Test progress_context with determinate and indeterminate progress - Test create_stage_progress for multi-stage operations - Test cleanup on exceptions for both context managers - Test no-op functions in non-interactive mode - Test empty stages handling

Tasks completed: 2/2 - Add Rich dependency and create progress module - Unit tests for progress module SUMMARY: .planning/phases/10-progress-indication/10-01-SUMMARY.md

- Import create_stage_progress from mlpstorage.progress - Wrap run() operations in 4-stage progress indicator: * Validating environment... * Collecting cluster info... * Running benchmark... * Processing results... - Stage progress shows elapsed time during benchmark execution - DLIO output flows through directly (no wrapping)

- Import progress_context from mlpstorage.progress - Add spinner to _collect_cluster_start(): * Shows host count in description * Updates description to show collection method (SSH/MPI) - Add spinner to _collect_cluster_end(): * Shows indeterminate spinner during collection * Updates description based on collection method - Spinners use total=None for indeterminate progress

- Import progress_context from mlpstorage.progress module - Add spinner to environment validation in run_benchmark() - Add spinner to lockfile validation in run_benchmark() - Add spinner to lockfile generate in handle_lockfile_command() - Add spinner to lockfile verify in handle_lockfile_command() - Fix test_tracks_runtime test to patch time module specifically in base module (needed because progress indicators log status messages which call time.time) All existing error handling is preserved - exceptions propagate after progress context cleanup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add TestBenchmarkProgress test class with 8 tests: * test_run_shows_stage_progress: Verify 4-stage progress in run() * test_run_non_interactive_logs_stages: Verify logger.status fallback * test_cluster_collection_shows_spinner: Verify spinner (total=None) * test_cluster_collection_updates_description_ssh: Verify SSH description * test_cluster_collection_updates_description_mpi: Verify MPI description * test_run_progress_cleanup_on_exception: Verify context cleanup * test_end_cluster_collection_shows_spinner: Verify end collection spinner * test_cluster_collection_skipped_logs_debug: Verify skip debug log

Tasks completed: 3/3 - Add stage indicators to benchmark run() method - Add spinners to cluster collection methods - Unit tests for progress integration SUMMARY: .planning/phases/10-progress-indication/10-02-SUMMARY.md

Phase 11: Comprehensive Parquet Support - Standard stack identified: PyArrow + Hydra/OmegaConf - Architecture patterns documented: config extension, memory-efficient reader, schema-driven generator - Pitfalls catalogued: memory_map misunderstanding, Table.to_numpy() non-existence, flat config assumption - PyArrow APIs verified: column filtering, row-group iteration, compression options, Hive partitioning - Integration points mapped: ConfigArguments extension, LoadConfig pattern, reader/generator interface Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Phase 11: Comprehensive Parquet Support - 3 plans in 2 waves - 2 parallel (wave 1), 1 sequential (wave 2) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Replace argonne-lcf/dlio_benchmark@mlperf_storage_v2.0 with wvaske/dlio_benchmark@parquet-support - Updated in both dlio and full optional dependency groups Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tasks completed: 1/1 - Update DLIO dependency to fork URL SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-03-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tasks completed: 2/2 - Add LZ4 and ZSTD to Compression enum - Add parquet config fields to ConfigArguments and LoadConfig SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-01-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tasks completed: 2/2 - Rewrite ParquetReader with column filtering and schema validation - Rewrite ParquetGenerator with schema-driven generation SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-02-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add dataset.parquet section with DLRM-realistic columns (dense_features, sparse_features, label) and snappy compression to all three configs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Reduce accelerator/process counts in test/run_tests.sh for smaller systems - Update results directory paths to /mnt/nvme/results - Fix file permissions across the repository Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Wes Vaske and others added 30 commits January 23, 2026 10:03

docs: initialize project

f8c4ae8

MLPerf Storage Benchmark Suite v3.0 - orchestration framework for MLCommons storage benchmarks with KV cache, VectorDB integration, and improved dependency management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: add project config

b66693c

Mode: yolo Depth: comprehensive Parallelization: enabled Workflow agents: research=on, plan_check=on, verifier=on Model profile: quality Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(01-02): add packaging dependency for version parsing

889e20c

- Add packaging>=21.0 to core dependencies - Enables PEP 440 version comparison in lockfile validator - Lightweight pure-Python dependency with no transitive deps

docs(01-02): complete CPU-only PyTorch configuration plan

ffe4589

Tasks completed: 2/2 - Add uv CPU-only index configuration - Add packaging dependency for version parsing SUMMARY: .planning/phases/01-package-management-foundation/01-02-SUMMARY.md

docs(01-01): complete lockfile module foundation plan

c30faec

Tasks completed: 1/1 - Create lockfile module structure with data models SUMMARY: .planning/phases/01-package-management-foundation/01-01-SUMMARY.md

feat(01-03): update lockfile module exports for generator

6697e53

- Add generator imports to __init__.py - Export generate_lockfile, generate_lockfiles_for_project - Export check_uv_available, LockfileGenerationError, GenerationOptions - Update module docstring to document generation features

docs(01-03): complete lockfile generator plan

eec5c5d

Tasks completed: 2/2 - Task 1: Create lockfile generator module - Task 2: Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-03-SUMMARY.md

docs(01-04): complete runtime version validation plan

0e2babd

Tasks completed: 2/2 - Create lockfile validator module - Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-04-SUMMARY.md

feat(01-05): integrate lockfile subcommand into CLI

4f18401

- Add lockfile_arguments export to cli/__init__.py - Add lockfile_parsers subparser in cli_parser.py - Register add_lockfile_arguments builder - Lockfile subcommand now accessible via 'mlpstorage lockfile'

docs: mark Phase 1 complete in roadmap

83a6f39

Phase 1: Package Management Foundation is now complete with all requirements delivered (PKG-01, PKG-02, PKG-03). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

wvaske and others added 29 commits January 25, 2026 01:35

docs(09): complete DLIO Parquet Support phase

6e778ef

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs(10): create phase plan

d764d32

Phase 10: Progress Indication - 3 plans in 2 waves - 1 parallel (10-01), 2 dependent (10-02, 10-03) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs(10-01): complete progress indication foundation plan

b07773a

Tasks completed: 2/2 - Add Rich dependency and create progress module - Unit tests for progress module SUMMARY: .planning/phases/10-progress-indication/10-01-SUMMARY.md

docs(10-02): complete benchmark progress integration plan

ce7aea6

Tasks completed: 3/3 - Add stage indicators to benchmark run() method - Add spinners to cluster collection methods - Unit tests for progress integration SUMMARY: .planning/phases/10-progress-indication/10-02-SUMMARY.md

docs(10-03): complete main.py progress integration plan

7d994a8

docs(10): complete Progress Indication phase

1e549e0

docs(11): create phase plan for comprehensive parquet support

f13bed4

Phase 11: Comprehensive Parquet Support - 3 plans in 2 waves - 2 parallel (wave 1), 1 sequential (wave 2) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(11-03): update DLIO dependency to wvaske fork with parquet support

9638fcd

- Replace argonne-lcf/dlio_benchmark@mlperf_storage_v2.0 with wvaske/dlio_benchmark@parquet-support - Updated in both dlio and full optional dependency groups Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs(11-03): complete DLIO fork dependency plan

e877986

Tasks completed: 1/1 - Update DLIO dependency to fork URL SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-03-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs(11): complete Comprehensive Parquet Support phase

4612ba5

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: add UAT docs, phase context, and uv lockfile

2f2c527

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: update DLRM parquet configs with schema-driven columns

78b838d

Add dataset.parquet section with DLRM-realistic columns (dense_features, sparse_features, label) and snappy compression to all three configs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: update test params and fix file permissions

9d13f57

- Reduce accelerator/process counts in test/run_tests.sh for smaller systems - Update results directory paths to /mnt/nvme/results - Fix file permissions across the repository Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

wvaske merged commit 2ae3ee9 into main Feb 16, 2026
1 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude 01#24

Claude 01#24
wvaske merged 155 commits into
mainfrom
claude-01

wvaske commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wvaske commented Feb 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant