Implement Phase 8: Error Handling and User Messaging by wvaske · Pull Request #23 · wvaske/mlperf-storage

wvaske · 2026-01-15T20:37:16Z

No description provided.

Create clear interfaces that define contracts for benchmarks, validators, and collectors. This establishes the foundation for the code improvement plan. New files: - mlpstorage/interfaces/__init__.py - Package exports - mlpstorage/interfaces/benchmark.py - BenchmarkInterface, BenchmarkConfig, BenchmarkCommand - mlpstorage/interfaces/validator.py - ValidatorInterface, ValidationResult, ClosedRequirements - mlpstorage/interfaces/collector.py - ClusterCollectorInterface, LocalCollectorInterface Modified: - mlpstorage/benchmarks/base.py - Implement BenchmarkInterface - Add dependency injection for cluster_collector and validator - Add default implementations for interface methods - Improve documentation with docstrings Benefits: - Testability: Components can be mocked/stubbed for unit testing - Extensibility: New implementations without modifying core code - Consistency: All implementations follow the same contract - Documentation: Interfaces clearly document expected behavior

Split the monolithic rules.py (1,434 LOC) into focused modules with clear separation between single-run rule checkers and multi-run submission checkers. New package structure: mlpstorage/rules/ ├── __init__.py # Public API exports (backwards compatible) ├── base.py # RulesChecker ABC, RuleState enum ├── issues.py # Issue dataclass ├── models.py # Data classes (HostInfo, ClusterInformation, BenchmarkRun, etc.) ├── run_checkers/ # Single-run validation checkers │ ├── __init__.py │ ├── base.py # RunRulesChecker base class │ ├── training.py # TrainingRunRulesChecker │ └── checkpointing.py # CheckpointingRunRulesChecker ├── submission_checkers/ # Multi-run submission validation checkers │ ├── __init__.py │ ├── base.py # MultiRunRulesChecker base class │ ├── training.py # TrainingSubmissionRulesChecker │ └── checkpointing.py # CheckpointSubmissionRulesChecker ├── verifier.py # BenchmarkVerifier orchestrator └── utils.py # calculate_training_data_size, generate_output_location Benefits: - Clear separation of concerns (single-run vs multi-run validation) - Each module is <300 LOC (down from 1,434 LOC monolith) - Better testability with focused, isolated checkers - Backwards compatible - existing imports continue to work - Original rules.py preserved as rules_legacy.py for reference

This commit refactors the CLI architecture into a modular structure: - Add BenchmarkRegistry class (mlpstorage/registry.py) for dynamic benchmark registration and CLI construction - Create mlpstorage/cli/ package with modular argument builders: - common_args.py: Shared help messages and universal arguments - training_args.py: Training benchmark arguments - checkpointing_args.py: Checkpointing benchmark arguments - vectordb_args.py: VectorDB benchmark arguments - utility_args.py: Reports and history arguments - Rename cli.py to cli_parser.py to avoid package name conflict - Update benchmarks/__init__.py to register benchmarks at import time - Update imports in main.py, history.py, and tests The registry pattern enables: - Dynamic benchmark registration without modifying core CLI code - Easier extensibility for new benchmark types - Cleaner separation of concerns

Add test fixtures package for improved testability: - tests/fixtures/__init__.py: Package exports - tests/fixtures/mock_logger.py: MockLogger class with message capture and assertion helpers for testing logging behavior - tests/fixtures/mock_executor.py: MockCommandExecutor for testing command execution without subprocess calls - tests/fixtures/mock_collector.py: MockClusterCollector implementing ClusterCollectorInterface for testing without MPI - tests/fixtures/sample_data.py: Sample /proc file contents and factory functions for creating test data Update conftest.py with new fixtures: - mock_executor, mock_executor_with_dlio, mock_executor_failure - mock_collector, mock_collector_multi_host, mock_collector_failure - test_logger using MockLogger Add integration test templates: - tests/integration/test_benchmark_flow.py: Tests for training and checkpointing benchmark flow with mocked dependencies

Add comprehensive type annotations and documentation: Type Annotations: - mlpstorage/utils.py: Full type annotations for all functions including is_valid_datetime_format, read_config_from_file, update_nested_dict, create_nested_dict, flatten_nested_dict, remove_nan_values, generate_mpi_prefix_cmd - mlpstorage/benchmarks/base.py: Enhanced type hints for __init__, _execute_command, metadata property, and all methods Module Documentation: - Added comprehensive module-level docstrings to utils.py and base.py explaining module purpose, classes, and usage Architecture Documentation: - docs/ARCHITECTURE.md: Complete system architecture documentation covering directory structure, core components, data flow, design patterns, and extension points Developer Guide: - docs/ADDING_BENCHMARKS.md: Step-by-step guide for adding new benchmark types including code examples for benchmark class, CLI arguments, validation rules, and testing

Add KV Cache benchmark for LLM inference storage testing: - Create KVCacheBenchmark class wrapping kv-cache.py script - Add KVCACHE_MODELS, KVCACHE_PERFORMANCE_PROFILES, and other config constants - Create kvcache_args.py with CLI argument builders for run/datasize commands - Add KVCacheRunRulesChecker for benchmark validation - Register kvcache benchmark with BenchmarkRegistry - Update ARCHITECTURE.md to document new benchmark The KV Cache benchmark tests storage performance for: - Multi-tier caching (GPU → CPU → NVMe) - Multi-user LLM inference simulation - Phase-aware processing (prefill/decode phases)

Refactor the reporting system to provide clear validation feedback: - Create mlpstorage/reporting/ package with: - directory_validator.py: Validates results directory structure before processing, with clear error messages and suggestions - formatters.py: ValidationMessageFormatter with colored output, ClosedRequirementsFormatter for requirement checklists, ReportSummaryFormatter for summary headers - Refactor ReportGenerator class: - Add directory structure validation before processing - Add error isolation for individual run processing - Improve print_results() with clear OPEN vs CLOSED messaging - Show INVALID runs first (most critical), then OPEN, then CLOSED - Add requirements checklists for non-CLOSED submissions - Update ARCHITECTURE.md with new reporting/ package structure This improves user experience by providing actionable feedback when validation fails or directory structure is malformed.

Add comprehensive error handling framework with user-friendly messages: - Create mlpstorage/errors.py: - ErrorCode enum for machine-readable error codes - MLPSError dataclass for structured error information - MLPStorageException base class with code, message, suggestion - ConfigurationError, BenchmarkExecutionError, ValidationError, FileSystemError, MPIError, DependencyError specialized exceptions - Create mlpstorage/error_messages.py: - Centralized error message templates with placeholders - format_error() function for template substitution - ErrorFormatter class with color support - Create mlpstorage/validation_helpers.py: - validate_pre_run() for pre-flight configuration checks - _validate_required_params() for parameter validation - _validate_paths() for filesystem checks - _validate_hosts() for MPI host reachability - _validate_dependencies() for tool availability - check_disk_space() utility function - Update mlpstorage/main.py: - Add KVCacheBenchmark to program switch dict - Wrap run_benchmark() with proper exception handling - Add main() wrapper with catch-all exception handling - Show suggestions for each error type - Show traceback only in debug mode - Update ARCHITECTURE.md with new error handling modules

claude added 8 commits January 15, 2026 16:11

wvaske merged commit 81fc7f4 into main Jan 15, 2026
1 of 5 checks passed

wvaske added a commit that referenced this pull request Feb 26, 2026

Merge pull request #23 from wvaske/claude/code-improvement-UGEAO

881bcd8

wvaske added a commit that referenced this pull request Feb 26, 2026

Merge pull request #23 from wvaske/claude/code-improvement-UGEAO

b5600ce

wvaske added a commit that referenced this pull request Mar 2, 2026

Merge pull request #23 from wvaske/claude/code-improvement-UGEAO

f873a85

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Phase 8: Error Handling and User Messaging#23

Implement Phase 8: Error Handling and User Messaging#23
wvaske merged 8 commits into
mainfrom
claude/code-improvement-UGEAO

wvaske commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wvaske commented Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants