Skip to content

Claude/cluster info collection mpi x cxtk#20

Merged
wvaske merged 6 commits into
mainfrom
claude/cluster-info-collection-mpi-xCXTK
Jan 15, 2026
Merged

Claude/cluster info collection mpi x cxtk#20
wvaske merged 6 commits into
mainfrom
claude/cluster-info-collection-mpi-xCXTK

Conversation

@wvaske
Copy link
Copy Markdown
Owner

@wvaske wvaske commented Jan 15, 2026

No description provided.

This plan outlines a comprehensive approach to collect system information
(/proc/meminfo, /proc/cpuinfo, /proc/diskstats, etc.) from all cluster
nodes using MPI before benchmark execution.

Key components:
- New cluster_collector.py module with MPIClusterCollector class
- Extended data classes for disk and network stats
- Integration with Benchmark classes for pre-execution collection
- Metadata storage for post-execution verification
- Cluster consistency validation rules
- Backwards-compatible CLI interface

The implementation maintains backwards compatibility with existing
CLI args while adding MPI-based collection as the preferred method.
This commit introduces the core cluster information collection module
that uses MPI to gather system information from all nodes in a cluster.

New module mlpstorage/cluster_collector.py includes:

Data Classes:
- HostDiskInfo: Disk statistics from /proc/diskstats
- HostNetworkInfo: Network interface stats from /proc/net/dev
- HostSystemInfo: Extended system info (kernel, OS, uptime, load)

/proc File Parsers:
- parse_proc_meminfo: Parse /proc/meminfo
- parse_proc_cpuinfo: Parse /proc/cpuinfo
- parse_proc_diskstats: Parse /proc/diskstats
- parse_proc_net_dev: Parse /proc/net/dev
- parse_proc_version: Parse /proc/version
- parse_proc_loadavg: Parse /proc/loadavg
- parse_proc_uptime: Parse /proc/uptime
- parse_os_release: Parse /etc/os-release

Collection Functions:
- collect_local_system_info(): Collect all system info from local node
- summarize_cpuinfo(): Summarize CPU info (cores, sockets, model)

MPI Collector:
- MPIClusterCollector: Class that generates and executes MPI collection
- MPI_COLLECTOR_SCRIPT: Embedded Python script for MPI execution
- collect_cluster_info(): High-level function with fallback support

Features:
- Supports both mpirun and mpiexec
- Automatic fallback to local-only collection when MPI fails
- Handles kernels 4.18+ (discard fields) and 5.5+ (flush fields)
- JSON-serializable output for metadata storage
- Comprehensive error handling

Tests:
- 58 unit tests covering all parsers, data classes, and collectors
- Integration tests for JSON serialization roundtrips
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3165ebc9e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread mlpstorage/cluster_collector.py Outdated
Comment on lines +915 to +919
except ImportError:
# No MPI available, just collect local info
local_info = collect_local_info()
output = {local_info['hostname']: local_info}
with open(output_file, 'w') as f:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Surface mpi4py absence instead of silent local-only output

If mpirun launches this script on multiple hosts but mpi4py is missing on any of them, each rank takes this ImportError branch and writes a local-only JSON to the shared output_file. The collector then reads a single file on the launcher and treats the run as a successful MPI collection, silently dropping remote hosts and misreporting cluster size. This can invalidate downstream consistency checks. Consider exiting non-zero or emitting an explicit failure marker so collect_cluster_info can fall back to local-only collection.

Useful? React with 👍 / 👎.

This commit extends the existing data classes in rules.py to support
MPI-collected cluster information.

HostInfo Extensions:
- Added optional fields: disks, network, system, collection_timestamp
- Added from_collected_data() classmethod for MPI collector output
- Added to_dict() method for JSON serialization
- Integrated with HostDiskInfo, HostNetworkInfo, HostSystemInfo

ClusterInformation Extensions:
- Added new aggregated attributes:
  - num_hosts: Number of hosts in cluster
  - min_memory_bytes / max_memory_bytes: Memory range
  - collection_method: How data was collected (mpi, dlio_summary, args)
  - collection_timestamp: When data was collected
  - host_consistency_issues: List of detected inconsistencies
- Added from_mpi_collection() classmethod
- Added validate_cluster_consistency() method that checks:
  - Memory variance (>10% difference warning)
  - CPU core count consistency
  - OS version consistency
  - Kernel version consistency
- Extended as_dict() to include all new fields and host details
- Extended from_dict() to restore all extended fields

Bug Fixes:
- Fixed HostMemoryInfo.from_proc_meminfo_dict() to handle integer
  values from parse_proc_meminfo (was incorrectly calling .split()
  on integers)

Tests:
- Added 13 new tests for Phase 2 features
- All 90 tests pass (58 cluster_collector + 32 rules)
This commit integrates the MPI-based cluster information collection
with the Benchmark classes, enabling automatic collection of system
information before benchmark execution.

Base Benchmark class (base.py):
- Added _should_collect_cluster_info(): Determines if collection should run
  - Returns False for datagen/configview commands
  - Returns False if skip_cluster_collection is set
  - Returns False if no hosts specified
- Added _collect_cluster_information(): Collects via MPI when available
  - Uses collect_cluster_info() from cluster_collector module
  - Creates ClusterInformation from collected data
  - Logs consistency warnings if detected
  - Returns None on failure (allows fallback)
- Added write_cluster_info(): Writes detailed cluster info to JSON file

DLIOBenchmark class (dlio.py):
- Updated accumulate_host_info() to use MPI collection when available
- Falls back to CLI args (client_host_memory_in_gb) if MPI fails
- Sets collection_method to 'args' for fallback collection

Tests:
- Added 13 new tests in test_benchmarks.py covering:
  - _should_collect_cluster_info() conditions
  - _collect_cluster_information() MPI integration
  - accumulate_host_info() MPI/fallback behavior
  - write_cluster_info() file output
- All 103 tests pass (58 cluster_collector + 32 rules + 13 benchmarks)
Address PR review comment: When mpi4py is not available on remote hosts,
the script was silently falling back to local-only collection without
any indication of failure. This could corrupt multi-host collection
output and misreport cluster size.

Changes:
- MPI collector script now exits with code 1 when mpi4py import fails
- Writes error marker '_mpi_import_error' to output file for detection
- Collector now detects error marker and raises RuntimeError with
  informative message about installing mpi4py
- Added 4 new tests for error handling behavior

This ensures failures are clearly reported rather than silently producing
incorrect cluster information.
Create detailed 7-phase plan for improving mlpstorage codebase:

- Phase 1: Interfaces and abstractions for dependency injection
- Phase 2: Modular rules engine (split 1,434 LOC rules.py)
- Phase 3: CLI refactoring with benchmark registry
- Phase 4: Test infrastructure with shared fixtures and mocks
- Phase 5: Documentation and type annotations
- Phase 6: KV Cache benchmark integration
- Phase 7: Plugin architecture for extensibility

The plan prioritizes testability, modularity for new benchmarks
(VectorDB, KV Cache), and code organization improvements.
@wvaske wvaske merged commit 768feba into main Jan 15, 2026
4 of 5 checks passed
wvaske added a commit that referenced this pull request Feb 26, 2026
…-xCXTK

Claude/cluster info collection mpi x cxtk
wvaske added a commit that referenced this pull request Feb 26, 2026
…-xCXTK

Claude/cluster info collection mpi x cxtk
wvaske added a commit that referenced this pull request Mar 2, 2026
…-xCXTK

Claude/cluster info collection mpi x cxtk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants