Skip to content

Add Kubernetes monitor collector for SUNK cluster observability [3/3]#66

Open
gustcol wants to merge 6 commits intofacebookresearch:mainfrom
gustcol:feature/k8s-monitor
Open

Add Kubernetes monitor collector for SUNK cluster observability [3/3]#66
gustcol wants to merge 6 commits intofacebookresearch:mainfrom
gustcol:feature/k8s-monitor

Conversation

@gustcol
Copy link
Contributor

@gustcol gustcol commented Feb 27, 2026

Summary

  • Full kubernetes_monitor collector following the established CLI pattern (CliObject Protocol, collect_* generators, Click decorators, run_data_collection_loop)
  • Collects pod metrics (phase, restart count, Slurm job correlation) and node conditions (Ready, MemoryPressure, etc.)
  • Adds --namespace, --in-cluster/--no-in-cluster, and --label-selector options
  • Adds K8S_POD and K8S_NODE to DataIdentifier enum
  • Includes collector documentation

Stacked PR series: [1/3] #64 ← [2/3] #65[3/3]

Note: This PR is stacked on #65 and #64. Please merge those first.

Ref: #63

Test plan

  • All 7 collector unit tests pass (test_kubernetes_monitor.py)
  • All 28 K8s tests pass across all 3 test files
  • test_cli[kubernetes_monitor] passes in existing CLI test suite (gcm CLI registration works)
  • ufmt formatting clean
  • flake8 linting clean
  • No existing tests broken (42 passed, 1 skipped)

Introduce KubernetesPodRow/Payload and KubernetesNodeConditionRow/Payload
schemas following the existing Row+Payload(DerivedCluster) pattern, plus
a KubernetesClient Protocol mirroring SlurmClient for pluggable K8s data
sources. This is the foundation for Kubernetes-layer observability in
SUNK (Slurm-on-K8s) clusters.

Ref: facebookresearch#63
Implement KubernetesApiClient using the official kubernetes Python
library as an optional dependency. Supports both in-cluster and
kubeconfig auth, extracts slurm.coreweave.com/job-id annotations for
Slurm-K8s correlation, and emits one KubernetesPodRow per container.
Includes KubernetesFakeClient for testing with injectable data.

Ref: facebookresearch#63
Implement the full kubernetes_monitor collector following the established
CLI pattern (CliObject Protocol, collect_* generators, Click decorators,
run_data_collection_loop). Collects pod metrics and node conditions from
the Kubernetes API with namespace/label filtering. Adds K8S_POD and
K8S_NODE identifiers to DataIdentifier enum and registers the new
command in the gcm CLI entrypoint.

Ref: facebookresearch#63
@github-actions
Copy link

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@meta-cla meta-cla bot added the cla signed label Feb 27, 2026
Add type: ignore[import-not-found] comments to kubernetes imports
since the library lacks type stubs and is an optional dependency.
Rewrite FakeClock as standalone dataclass instead of inheriting from
Clock Protocol to avoid abstract method instantiation errors. Cast
generator results to concrete payload types before accessing attributes
to resolve DataclassInstance attribute access errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant