Claude/cluster info collection mpi x cxtk#20
Conversation
This plan outlines a comprehensive approach to collect system information (/proc/meminfo, /proc/cpuinfo, /proc/diskstats, etc.) from all cluster nodes using MPI before benchmark execution. Key components: - New cluster_collector.py module with MPIClusterCollector class - Extended data classes for disk and network stats - Integration with Benchmark classes for pre-execution collection - Metadata storage for post-execution verification - Cluster consistency validation rules - Backwards-compatible CLI interface The implementation maintains backwards compatibility with existing CLI args while adding MPI-based collection as the preferred method.
This commit introduces the core cluster information collection module that uses MPI to gather system information from all nodes in a cluster. New module mlpstorage/cluster_collector.py includes: Data Classes: - HostDiskInfo: Disk statistics from /proc/diskstats - HostNetworkInfo: Network interface stats from /proc/net/dev - HostSystemInfo: Extended system info (kernel, OS, uptime, load) /proc File Parsers: - parse_proc_meminfo: Parse /proc/meminfo - parse_proc_cpuinfo: Parse /proc/cpuinfo - parse_proc_diskstats: Parse /proc/diskstats - parse_proc_net_dev: Parse /proc/net/dev - parse_proc_version: Parse /proc/version - parse_proc_loadavg: Parse /proc/loadavg - parse_proc_uptime: Parse /proc/uptime - parse_os_release: Parse /etc/os-release Collection Functions: - collect_local_system_info(): Collect all system info from local node - summarize_cpuinfo(): Summarize CPU info (cores, sockets, model) MPI Collector: - MPIClusterCollector: Class that generates and executes MPI collection - MPI_COLLECTOR_SCRIPT: Embedded Python script for MPI execution - collect_cluster_info(): High-level function with fallback support Features: - Supports both mpirun and mpiexec - Automatic fallback to local-only collection when MPI fails - Handles kernels 4.18+ (discard fields) and 5.5+ (flush fields) - JSON-serializable output for metadata storage - Comprehensive error handling Tests: - 58 unit tests covering all parsers, data classes, and collectors - Integration tests for JSON serialization roundtrips
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3165ebc9e5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| except ImportError: | ||
| # No MPI available, just collect local info | ||
| local_info = collect_local_info() | ||
| output = {local_info['hostname']: local_info} | ||
| with open(output_file, 'w') as f: |
There was a problem hiding this comment.
Surface mpi4py absence instead of silent local-only output
If mpirun launches this script on multiple hosts but mpi4py is missing on any of them, each rank takes this ImportError branch and writes a local-only JSON to the shared output_file. The collector then reads a single file on the launcher and treats the run as a successful MPI collection, silently dropping remote hosts and misreporting cluster size. This can invalidate downstream consistency checks. Consider exiting non-zero or emitting an explicit failure marker so collect_cluster_info can fall back to local-only collection.
Useful? React with 👍 / 👎.
This commit extends the existing data classes in rules.py to support MPI-collected cluster information. HostInfo Extensions: - Added optional fields: disks, network, system, collection_timestamp - Added from_collected_data() classmethod for MPI collector output - Added to_dict() method for JSON serialization - Integrated with HostDiskInfo, HostNetworkInfo, HostSystemInfo ClusterInformation Extensions: - Added new aggregated attributes: - num_hosts: Number of hosts in cluster - min_memory_bytes / max_memory_bytes: Memory range - collection_method: How data was collected (mpi, dlio_summary, args) - collection_timestamp: When data was collected - host_consistency_issues: List of detected inconsistencies - Added from_mpi_collection() classmethod - Added validate_cluster_consistency() method that checks: - Memory variance (>10% difference warning) - CPU core count consistency - OS version consistency - Kernel version consistency - Extended as_dict() to include all new fields and host details - Extended from_dict() to restore all extended fields Bug Fixes: - Fixed HostMemoryInfo.from_proc_meminfo_dict() to handle integer values from parse_proc_meminfo (was incorrectly calling .split() on integers) Tests: - Added 13 new tests for Phase 2 features - All 90 tests pass (58 cluster_collector + 32 rules)
This commit integrates the MPI-based cluster information collection with the Benchmark classes, enabling automatic collection of system information before benchmark execution. Base Benchmark class (base.py): - Added _should_collect_cluster_info(): Determines if collection should run - Returns False for datagen/configview commands - Returns False if skip_cluster_collection is set - Returns False if no hosts specified - Added _collect_cluster_information(): Collects via MPI when available - Uses collect_cluster_info() from cluster_collector module - Creates ClusterInformation from collected data - Logs consistency warnings if detected - Returns None on failure (allows fallback) - Added write_cluster_info(): Writes detailed cluster info to JSON file DLIOBenchmark class (dlio.py): - Updated accumulate_host_info() to use MPI collection when available - Falls back to CLI args (client_host_memory_in_gb) if MPI fails - Sets collection_method to 'args' for fallback collection Tests: - Added 13 new tests in test_benchmarks.py covering: - _should_collect_cluster_info() conditions - _collect_cluster_information() MPI integration - accumulate_host_info() MPI/fallback behavior - write_cluster_info() file output - All 103 tests pass (58 cluster_collector + 32 rules + 13 benchmarks)
Address PR review comment: When mpi4py is not available on remote hosts, the script was silently falling back to local-only collection without any indication of failure. This could corrupt multi-host collection output and misreport cluster size. Changes: - MPI collector script now exits with code 1 when mpi4py import fails - Writes error marker '_mpi_import_error' to output file for detection - Collector now detects error marker and raises RuntimeError with informative message about installing mpi4py - Added 4 new tests for error handling behavior This ensures failures are clearly reported rather than silently producing incorrect cluster information.
Create detailed 7-phase plan for improving mlpstorage codebase: - Phase 1: Interfaces and abstractions for dependency injection - Phase 2: Modular rules engine (split 1,434 LOC rules.py) - Phase 3: CLI refactoring with benchmark registry - Phase 4: Test infrastructure with shared fixtures and mocks - Phase 5: Documentation and type annotations - Phase 6: KV Cache benchmark integration - Phase 7: Plugin architecture for extensibility The plan prioritizes testability, modularity for new benchmarks (VectorDB, KV Cache), and code organization improvements.
…-xCXTK Claude/cluster info collection mpi x cxtk
…-xCXTK Claude/cluster info collection mpi x cxtk
…-xCXTK Claude/cluster info collection mpi x cxtk
No description provided.