Add Kubernetes monitor collector for SUNK cluster observability [3/3]#66
Open
gustcol wants to merge 6 commits intofacebookresearch:mainfrom
Open
Add Kubernetes monitor collector for SUNK cluster observability [3/3]#66gustcol wants to merge 6 commits intofacebookresearch:mainfrom
gustcol wants to merge 6 commits intofacebookresearch:mainfrom
Conversation
Introduce KubernetesPodRow/Payload and KubernetesNodeConditionRow/Payload schemas following the existing Row+Payload(DerivedCluster) pattern, plus a KubernetesClient Protocol mirroring SlurmClient for pluggable K8s data sources. This is the foundation for Kubernetes-layer observability in SUNK (Slurm-on-K8s) clusters. Ref: facebookresearch#63
Implement KubernetesApiClient using the official kubernetes Python library as an optional dependency. Supports both in-cluster and kubeconfig auth, extracts slurm.coreweave.com/job-id annotations for Slurm-K8s correlation, and emits one KubernetesPodRow per container. Includes KubernetesFakeClient for testing with injectable data. Ref: facebookresearch#63
Implement the full kubernetes_monitor collector following the established CLI pattern (CliObject Protocol, collect_* generators, Click decorators, run_data_collection_loop). Collects pod metrics and node conditions from the Kubernetes API with namespace/label filtering. Adds K8S_POD and K8S_NODE identifiers to DataIdentifier enum and registers the new command in the gcm CLI entrypoint. Ref: facebookresearch#63
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
Add type: ignore[import-not-found] comments to kubernetes imports since the library lacks type stubs and is an optional dependency.
Rewrite FakeClock as standalone dataclass instead of inheriting from Clock Protocol to avoid abstract method instantiation errors. Cast generator results to concrete payload types before accessing attributes to resolve DataclassInstance attribute access errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
kubernetes_monitorcollector following the established CLI pattern (CliObject Protocol, collect_* generators, Click decorators,run_data_collection_loop)--namespace,--in-cluster/--no-in-cluster, and--label-selectoroptionsK8S_PODandK8S_NODEtoDataIdentifierenumStacked PR series: [1/3] #64 ← [2/3] #65 ← [3/3]
Ref: #63
Test plan
test_kubernetes_monitor.py)test_cli[kubernetes_monitor]passes in existing CLI test suite (gcm CLI registration works)ufmtformatting cleanflake8linting clean