Skip to content

Feat/telemetry export clean#89

Open
theap06 wants to merge 3 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean
Open

Feat/telemetry export clean#89
theap06 wants to merge 3 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean

Conversation

@theap06
Copy link
Contributor

@theap06 theap06 commented Mar 4, 2026

Closes #87

Summary

Adds a new telemetry exporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.

What's New

  • New sink: --sink=telemetry
  • Output formats: JSON (NDJSON) or CSV
  • Options:
    • file_path (required): Path to the output file
    • format (optional): json (default) or csv

Usage

# JSON (NDJSON, one object per line)
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/var/log/gcm/telemetry.json --once

# CSV
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/var/log/gcm/telemetry.csv --sink-opt format=csv --once

Example Output (JSON)

{"timestamp": "2026-03-04T21:31:22", "hostname": "node-42", "gpu_id": 3, "job_id": 91283, "job_user": "research_team", "gpu_util": 88, "mem_used_percent": 71, "temperature": 78, "power_draw": 310, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}

Implementation

  • ~60 lines of code in gcm/exporters/telemetry.py
  • Follows existing exporter conventions (@register, write(Log, SinkAdditionalParams))
  • Works with gcm nvml_monitor, gcm slurm_monitor, and health check commands
  • Auto-creates parent directories if needed

Testing

pytest gcm/tests/test_telemetry_exporter.py -v

theap06 added 2 commits March 4, 2026 14:42
…research#87)

Add a new 'telemetry' sink that periodically appends telemetry snapshots
to a local file in JSON or CSV format for offline analysis.

- JSON: NDJSON format (one object per line)
- CSV: Header on first write, append rows
- Options: file_path (required), format (json|csv, default json)
- Works with nvml_monitor, slurm_monitor, and health checks

Closes facebookresearch#87

Made-with: Cursor
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀[Feature Request]: Structured Telemetry Export (CSV/JSON) for Offline Analysis

1 participant