Conversation
- Add dependency_check.py module with fail-fast validation for MPI and DLIO - Fix RunRulesChecker initialization order bug (set benchmark_run before super().__init__) - Rename reporting.py to report_generator.py to avoid module/package naming conflict - Add unit tests for dependency validation (16 tests) - Add import validation tests to catch module naming issues (14 tests) - Add rules checker initialization tests (4 tests) - Add integration tests for verification flow and dependency validation (7 tests) - Update CLAUDE.md with test environment directory - Fix test patch paths for renamed modules Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Make RunID dataclass hashable by adding frozen=True, allowing it to be used as dictionary keys in report_generator.py. Also fix the type annotation for run_results from Dict[str, Result] to Dict[RunID, Result]. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update test environment section with new data and results directories at /databases/mlps-v3.0/. Add example commands for running datagen and training benchmarks with unet3d model. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7 codebase documents generated: - STACK.md: Python 3.10+, DLIO benchmark, MPI integration - INTEGRATIONS.md: DLIO, MPI, hydra-core, PyArrow - ARCHITECTURE.md: Benchmark system with registry pattern - STRUCTURE.md: mlpstorage package layout - CONVENTIONS.md: Code style and patterns - TESTING.md: pytest-based test structure - CONCERNS.md: Technical debt and improvement areas Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MLPerf Storage Benchmark Suite v3.0 - orchestration framework for MLCommons storage benchmarks with KV cache, VectorDB integration, and improved dependency management. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Mode: yolo Depth: comprehensive Parallelization: enabled Workflow agents: research=on, plan_check=on, verifier=on Model profile: quality Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
21 requirements across 5 categories: - Package Management (3): lockfile, remove GPU deps, version validation - Benchmark Integration (5): KV cache, VectorDB classes and CLI - Training Updates (4): dlrm, retinanet, flux, parquet support - Host Collection (5): SSH collection, /proc/ data, time-series - Error Handling & UX (4): detection, suggestions, validation, progress Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phases: 1. Package Management Foundation: PKG-01, PKG-02, PKG-03 2. Environment Validation and Fail-Fast: UX-01, UX-02, UX-03 3. KV Cache Benchmark Integration: BENCH-01, BENCH-02 4. VectorDB Benchmark Integration: BENCH-03, BENCH-04 5. Benchmark Validation Pipeline Integration: BENCH-05 6. SSH-Based Host Collection: HOST-01, HOST-02, HOST-03 7. Time-Series Host Data Collection: HOST-04, HOST-05 8. New Training Models: TRAIN-01, TRAIN-02, TRAIN-03 9. DLIO Parquet Support: TRAIN-04 10. Progress Indication: UX-04 All 21 v1 requirements mapped to phases. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1: Package Management Foundation - Standard stack identified (uv, importlib.metadata, packaging) - Architecture patterns documented for lockfile CLI integration - GPU dependency exclusion strategy via CPU-only indexes - Runtime version validation patterns catalogued
Phase 01: Package Management Foundation - 5 plans in 3 waves - Wave 1: 01-01 (models), 01-02 (pyproject) - parallel - Wave 2: 01-03 (generator), 01-04 (validator) - parallel - Wave 3: 01-05 (CLI integration) - depends on 03, 04 - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Task 4 to implement --verify-lockfile flag on benchmark commands. This addresses the plan checker finding that Success Criterion 4 (benchmark fails on version mismatch) was not covered. Changes: - Add Task 4: --verify-lockfile flag implementation - Add mlpstorage/cli/common_args.py to files_modified - Update verification checklist with benchmark validation tests - Add success criteria 6 and 7 for benchmark integration Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add [[tool.uv.index]] section directing PyTorch to CPU wheels - Configure torch, torchvision, torchaudio to use pytorch-cpu index - Update full extra comment explaining CPU-only PyTorch usage - Prevents accidental CUDA/GPU dependencies for storage benchmark
- Add packaging>=21.0 to core dependencies - Enables PEP 440 version comparison in lockfile validator - Lightweight pure-Python dependency with no transitive deps
- Add mlpstorage/lockfile/ module with __init__.py and models.py - Define LockedPackage dataclass for package entries - Define ValidationResult dataclass for validation outcomes - Define LockfileMetadata dataclass for lockfile info - Add parse_lockfile() function for parsing requirements.txt format - Support package==version, hashes, markers, VCS/URL deps - Follow existing codebase patterns (dataclasses, type hints, docstrings)
Tasks completed: 2/2 - Add uv CPU-only index configuration - Add packaging dependency for version parsing SUMMARY: .planning/phases/01-package-management-foundation/01-02-SUMMARY.md
Tasks completed: 1/1 - Create lockfile module structure with data models SUMMARY: .planning/phases/01-package-management-foundation/01-01-SUMMARY.md
- Add LockfileGenerationError exception class with stderr/return_code tracking - Add GenerationOptions dataclass for configurable generation parameters - Implement check_uv_available() to verify uv installation - Implement generate_lockfile() wrapping uv pip compile - Implement generate_lockfiles_for_project() for base and full lockfiles - Support extras, hashes, universal, python-version, exclude-newer options
- Add validate_package() for single package version checking - Add validate_lockfile() for full lockfile validation - Add format_validation_report() for human-readable output - Use importlib.metadata for runtime version checking - Support VCS dependency skipping (git+, hg+, svn+) - Default skip mpi4py (system MPI dependency) - Return LockfileValidationResult with detailed metrics
- Add generator imports to __init__.py - Export generate_lockfile, generate_lockfiles_for_project - Export check_uv_available, LockfileGenerationError, GenerationOptions - Update module docstring to document generation features
Tasks completed: 2/2 - Task 1: Create lockfile generator module - Task 2: Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-03-SUMMARY.md
Tasks completed: 2/2 - Create lockfile validator module - Update lockfile module exports SUMMARY: .planning/phases/01-package-management-foundation/01-04-SUMMARY.md
- Create lockfile_args.py with add_lockfile_arguments function - Supports 'generate' subcommand with --output, --extra, --hashes, --python-version, --pyproject, --all - Supports 'verify' subcommand with --lockfile, --skip, --allow-missing, --strict - Follows existing CLI patterns from utility_args.py
- Add lockfile_arguments export to cli/__init__.py - Add lockfile_parsers subparser in cli_parser.py - Register add_lockfile_arguments builder - Lockfile subcommand now accessible via 'mlpstorage lockfile'
- Import lockfile module functions in main.py - Add handle_lockfile_command function to process generate/verify commands - Integrate lockfile handler into _main_impl flow - Generate command creates single or all lockfiles - Verify command validates packages and shows report - Clear error messages with actionable suggestions
- Add --verify-lockfile argument to add_universal_arguments - Validate lockfile before benchmark execution in run_benchmark - Fail with clear error message on version mismatch - Show helpful suggestions (pip install -r, uv pip sync) - Display number of verified packages on success - Available on all benchmark commands (training, checkpointing, vectordb, kvcache)
Tasks completed: 4/4 - Task 1: Create lockfile CLI argument builder - Task 2: Update CLI module exports and parser - Task 3: Add lockfile command handler to main - Task 4: Add --verify-lockfile flag to benchmark commands SUMMARY: .planning/phases/01-package-management-foundation/01-05-SUMMARY.md Phase 1 Package Management Foundation: COMPLETE - PKG-01: Lockfile generation ✓ - PKG-02: CPU-only PyTorch config ✓ - PKG-03: Runtime version validation ✓ - CLI integration ✓
Phase 1: Package Management Foundation is now complete with all requirements delivered (PKG-01, PKG-02, PKG-03). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 2: Environment Validation and Fail-Fast - Standard stack: platform, shutil.which, distro, importlib.metadata - Architecture patterns for OS detection, fail-fast validation chain - OS-specific installation instructions mapping - Integration points with existing dependency_check.py and validation_helpers.py - Common pitfalls: SSH keys, PATH issues, MPI version mismatch - Code examples extending existing codebase infrastructure Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 02: Environment Validation and Fail-Fast - 5 plan(s) in 4 wave(s) - Wave 1: OS detection module (parallel) - Wave 2: Dependency checking + SSH validation (parallel) - Wave 3: Comprehensive validator - Wave 4: Integration into execution path - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address 4 blocker issues from plan verification:
1. Plan 02-02 Task 1: Explicitly specify that check functions call
detect_os() to get OSInfo, then get_install_instruction() with
dependency name and OSInfo to get OS-specific install command
2. Plan 02-02 Task 2: Add verification that templates work with
format_error() function, not just that they exist
3. Plan 02-03 Task 1: Add SSH binary existence check using
shutil.which('ssh') BEFORE running connectivity tests, with
OS-specific install hint if missing (satisfies UX-03)
4. Plan 02-05 Task 1: Clarify that validate_benchmark_environment
REPLACES validate_pre_run as the main entry point. validate_pre_run
is deprecated but remains for backward compatibility
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Changes made: - 09-01: Clarify all DLIO work happens in local dlio_parquet_fork/ - 09-01: Add user_setup for GitHub fork push (not autonomous) - 09-01: Fix files_modified to use actual paths (not clone instructions) - 09-01: Clarify pyproject.toml update happens AFTER user pushes fork - 09-02: Add Task 3 for end-to-end verification (TRAIN-04 requirement) - 09-02: Add truth about parquet datagen creating actual files - Both plans now clearly separate autonomous work from user actions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tasks completed: 3/3 - Clone DLIO and add PARQUET to FormatType enum - Create ParquetReader and ParquetGenerator classes - Register parquet format in factories and commit changes SUMMARY: .planning/phases/09-dlio-parquet-support/09-01-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add dlrm_parquet_h100.yaml with format: parquet, computation_time: 0.005 - Add dlrm_parquet_a100.yaml with format: parquet, computation_time: 0.007 - Add dlrm_parquet_datagen.yaml for parquet data generation - Use dlrm_parquet/ data folder to distinguish from npz-based dlrm/ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add TestTrainingParquetFormat test class with 5 tests - Verify dataset.format is in OPEN_ALLOWED_PARAMS - Verify parquet format override results in OPEN category - Verify DLRM model recognized with parquet format - Confirm validation rules handle parquet correctly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tasks completed: 3/3 - Created DLRM parquet config files (h100, a100, datagen) - Added 5 parquet validation tests - End-to-end verification: parquet datagen creates valid files SUMMARY: .planning/phases/09-dlio-parquet-support/09-02-SUMMARY.md TRAIN-04 requirement satisfied Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 10: Progress Indication - Standard stack identified: Rich library (already in dependency tree) - Architecture patterns documented for TTY detection and fallbacks - Integration points mapped across codebase - Pitfalls catalogued for terminal handling edge cases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 10: Progress Indication - 3 plans in 2 waves - 1 parallel (10-01), 2 dependent (10-02, 10-03) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Plan 10-02: - Split Task 1 into Task 1 (stage indicators) and Task 2 (spinners) - Added truth: "User sees elapsed time during 'Running benchmark...' stage" - Clarified that stage progress shows TimeElapsedColumn during benchmark Plan 10-03: - Added explicit error handling preservation notes in Task 1 - Added truth: "Error handling is preserved when operations are wrapped in progress" - Added verification test for error handling (lockfile not found) - Updated human verification checkpoint to test error propagation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add rich>=13.0 to pyproject.toml dependencies with comment - Create mlpstorage/progress.py with TTY-aware progress utilities - Implement is_interactive_terminal() for TTY detection - Implement progress_context() for determinate/indeterminate progress - Implement create_stage_progress() for multi-stage operations - Non-interactive mode logs via logger.status() fallback
- Add 20 tests covering TTY detection, interactive/non-interactive modes - Test progress_context with determinate and indeterminate progress - Test create_stage_progress for multi-stage operations - Test cleanup on exceptions for both context managers - Test no-op functions in non-interactive mode - Test empty stages handling
Tasks completed: 2/2 - Add Rich dependency and create progress module - Unit tests for progress module SUMMARY: .planning/phases/10-progress-indication/10-01-SUMMARY.md
- Import create_stage_progress from mlpstorage.progress - Wrap run() operations in 4-stage progress indicator: * Validating environment... * Collecting cluster info... * Running benchmark... * Processing results... - Stage progress shows elapsed time during benchmark execution - DLIO output flows through directly (no wrapping)
- Import progress_context from mlpstorage.progress - Add spinner to _collect_cluster_start(): * Shows host count in description * Updates description to show collection method (SSH/MPI) - Add spinner to _collect_cluster_end(): * Shows indeterminate spinner during collection * Updates description based on collection method - Spinners use total=None for indeterminate progress
- Import progress_context from mlpstorage.progress module - Add spinner to environment validation in run_benchmark() - Add spinner to lockfile validation in run_benchmark() - Add spinner to lockfile generate in handle_lockfile_command() - Add spinner to lockfile verify in handle_lockfile_command() - Fix test_tracks_runtime test to patch time module specifically in base module (needed because progress indicators log status messages which call time.time) All existing error handling is preserved - exceptions propagate after progress context cleanup. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add TestBenchmarkProgress test class with 8 tests: * test_run_shows_stage_progress: Verify 4-stage progress in run() * test_run_non_interactive_logs_stages: Verify logger.status fallback * test_cluster_collection_shows_spinner: Verify spinner (total=None) * test_cluster_collection_updates_description_ssh: Verify SSH description * test_cluster_collection_updates_description_mpi: Verify MPI description * test_run_progress_cleanup_on_exception: Verify context cleanup * test_end_cluster_collection_shows_spinner: Verify end collection spinner * test_cluster_collection_skipped_logs_debug: Verify skip debug log
Tasks completed: 3/3 - Add stage indicators to benchmark run() method - Add spinners to cluster collection methods - Unit tests for progress integration SUMMARY: .planning/phases/10-progress-indication/10-02-SUMMARY.md
Phase 11: Comprehensive Parquet Support - Standard stack identified: PyArrow + Hydra/OmegaConf - Architecture patterns documented: config extension, memory-efficient reader, schema-driven generator - Pitfalls catalogued: memory_map misunderstanding, Table.to_numpy() non-existence, flat config assumption - PyArrow APIs verified: column filtering, row-group iteration, compression options, Hive partitioning - Integration points mapped: ConfigArguments extension, LoadConfig pattern, reader/generator interface Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 11: Comprehensive Parquet Support - 3 plans in 2 waves - 2 parallel (wave 1), 1 sequential (wave 2) - Ready for execution Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace argonne-lcf/dlio_benchmark@mlperf_storage_v2.0 with wvaske/dlio_benchmark@parquet-support - Updated in both dlio and full optional dependency groups Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tasks completed: 1/1 - Update DLIO dependency to fork URL SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-03-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tasks completed: 2/2 - Add LZ4 and ZSTD to Compression enum - Add parquet config fields to ConfigArguments and LoadConfig SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-01-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tasks completed: 2/2 - Rewrite ParquetReader with column filtering and schema validation - Rewrite ParquetGenerator with schema-driven generation SUMMARY: .planning/phases/11-comprehensive-parquet-support/11-02-SUMMARY.md Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add dataset.parquet section with DLRM-realistic columns (dense_features, sparse_features, label) and snappy compression to all three configs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reduce accelerator/process counts in test/run_tests.sh for smaller systems - Update results directory paths to /mnt/nvme/results - Fix file permissions across the repository Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.