Claude/cluster info collection mpi x cxtk by wvaske · Pull Request #20 · wvaske/mlperf-storage

wvaske · 2026-01-15T00:53:47Z

No description provided.

This plan outlines a comprehensive approach to collect system information (/proc/meminfo, /proc/cpuinfo, /proc/diskstats, etc.) from all cluster nodes using MPI before benchmark execution. Key components: - New cluster_collector.py module with MPIClusterCollector class - Extended data classes for disk and network stats - Integration with Benchmark classes for pre-execution collection - Metadata storage for post-execution verification - Cluster consistency validation rules - Backwards-compatible CLI interface The implementation maintains backwards compatibility with existing CLI args while adding MPI-based collection as the preferred method.

This commit introduces the core cluster information collection module that uses MPI to gather system information from all nodes in a cluster. New module mlpstorage/cluster_collector.py includes: Data Classes: - HostDiskInfo: Disk statistics from /proc/diskstats - HostNetworkInfo: Network interface stats from /proc/net/dev - HostSystemInfo: Extended system info (kernel, OS, uptime, load) /proc File Parsers: - parse_proc_meminfo: Parse /proc/meminfo - parse_proc_cpuinfo: Parse /proc/cpuinfo - parse_proc_diskstats: Parse /proc/diskstats - parse_proc_net_dev: Parse /proc/net/dev - parse_proc_version: Parse /proc/version - parse_proc_loadavg: Parse /proc/loadavg - parse_proc_uptime: Parse /proc/uptime - parse_os_release: Parse /etc/os-release Collection Functions: - collect_local_system_info(): Collect all system info from local node - summarize_cpuinfo(): Summarize CPU info (cores, sockets, model) MPI Collector: - MPIClusterCollector: Class that generates and executes MPI collection - MPI_COLLECTOR_SCRIPT: Embedded Python script for MPI execution - collect_cluster_info(): High-level function with fallback support Features: - Supports both mpirun and mpiexec - Automatic fallback to local-only collection when MPI fails - Handles kernels 4.18+ (discard fields) and 5.5+ (flush fields) - JSON-serializable output for metadata storage - Comprehensive error handling Tests: - 58 unit tests covering all parsers, data classes, and collectors - Integration tests for JSON serialization roundtrips

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3165ebc9e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-15T00:56:39Z

+    except ImportError:
+        # No MPI available, just collect local info
+        local_info = collect_local_info()
+        output = {local_info['hostname']: local_info}
+        with open(output_file, 'w') as f:


Surface mpi4py absence instead of silent local-only output

If mpirun launches this script on multiple hosts but mpi4py is missing on any of them, each rank takes this ImportError branch and writes a local-only JSON to the shared output_file. The collector then reads a single file on the launcher and treats the run as a successful MPI collection, silently dropping remote hosts and misreporting cluster size. This can invalidate downstream consistency checks. Consider exiting non-zero or emitting an explicit failure marker so collect_cluster_info can fall back to local-only collection.

Useful? React with 👍 / 👎.

This commit extends the existing data classes in rules.py to support MPI-collected cluster information. HostInfo Extensions: - Added optional fields: disks, network, system, collection_timestamp - Added from_collected_data() classmethod for MPI collector output - Added to_dict() method for JSON serialization - Integrated with HostDiskInfo, HostNetworkInfo, HostSystemInfo ClusterInformation Extensions: - Added new aggregated attributes: - num_hosts: Number of hosts in cluster - min_memory_bytes / max_memory_bytes: Memory range - collection_method: How data was collected (mpi, dlio_summary, args) - collection_timestamp: When data was collected - host_consistency_issues: List of detected inconsistencies - Added from_mpi_collection() classmethod - Added validate_cluster_consistency() method that checks: - Memory variance (>10% difference warning) - CPU core count consistency - OS version consistency - Kernel version consistency - Extended as_dict() to include all new fields and host details - Extended from_dict() to restore all extended fields Bug Fixes: - Fixed HostMemoryInfo.from_proc_meminfo_dict() to handle integer values from parse_proc_meminfo (was incorrectly calling .split() on integers) Tests: - Added 13 new tests for Phase 2 features - All 90 tests pass (58 cluster_collector + 32 rules)

This commit integrates the MPI-based cluster information collection with the Benchmark classes, enabling automatic collection of system information before benchmark execution. Base Benchmark class (base.py): - Added _should_collect_cluster_info(): Determines if collection should run - Returns False for datagen/configview commands - Returns False if skip_cluster_collection is set - Returns False if no hosts specified - Added _collect_cluster_information(): Collects via MPI when available - Uses collect_cluster_info() from cluster_collector module - Creates ClusterInformation from collected data - Logs consistency warnings if detected - Returns None on failure (allows fallback) - Added write_cluster_info(): Writes detailed cluster info to JSON file DLIOBenchmark class (dlio.py): - Updated accumulate_host_info() to use MPI collection when available - Falls back to CLI args (client_host_memory_in_gb) if MPI fails - Sets collection_method to 'args' for fallback collection Tests: - Added 13 new tests in test_benchmarks.py covering: - _should_collect_cluster_info() conditions - _collect_cluster_information() MPI integration - accumulate_host_info() MPI/fallback behavior - write_cluster_info() file output - All 103 tests pass (58 cluster_collector + 32 rules + 13 benchmarks)

Address PR review comment: When mpi4py is not available on remote hosts, the script was silently falling back to local-only collection without any indication of failure. This could corrupt multi-host collection output and misreport cluster size. Changes: - MPI collector script now exits with code 1 when mpi4py import fails - Writes error marker '_mpi_import_error' to output file for detection - Collector now detects error marker and raises RuntimeError with informative message about installing mpi4py - Added 4 new tests for error handling behavior This ensures failures are clearly reported rather than silently producing incorrect cluster information.

Create detailed 7-phase plan for improving mlpstorage codebase: - Phase 1: Interfaces and abstractions for dependency injection - Phase 2: Modular rules engine (split 1,434 LOC rules.py) - Phase 3: CLI refactoring with benchmark registry - Phase 4: Test infrastructure with shared fixtures and mocks - Phase 5: Documentation and type annotations - Phase 6: KV Cache benchmark integration - Phase 7: Plugin architecture for extensibility The plan prioritizes testability, modularity for new benchmarks (VectorDB, KV Cache), and code organization improvements.

…-xCXTK Claude/cluster info collection mpi x cxtk

claude added 2 commits January 15, 2026 00:21

chatgpt-codex-connector Bot reviewed Jan 15, 2026

View reviewed changes

claude added 4 commits January 15, 2026 01:00

wvaske merged commit 768feba into main Jan 15, 2026
4 of 5 checks passed

wvaske added a commit that referenced this pull request Feb 26, 2026

Merge pull request #20 from wvaske/claude/cluster-info-collection-mpi…

cae17d2

…-xCXTK Claude/cluster info collection mpi x cxtk

wvaske added a commit that referenced this pull request Feb 26, 2026

Merge pull request #20 from wvaske/claude/cluster-info-collection-mpi…

5df2a47

…-xCXTK Claude/cluster info collection mpi x cxtk

wvaske added a commit that referenced this pull request Mar 2, 2026

Merge pull request #20 from wvaske/claude/cluster-info-collection-mpi…

92cf152

…-xCXTK Claude/cluster info collection mpi x cxtk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Claude/cluster info collection mpi x cxtk#20

Claude/cluster info collection mpi x cxtk#20
wvaske merged 6 commits into
mainfrom
claude/cluster-info-collection-mpi-xCXTK

wvaske commented Jan 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wvaske commented Jan 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants