Skip to content

Add HealthCheckRuntime context manager for shared boilerplate [1/2]#77

Open
gustcol wants to merge 2 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-runtime
Open

Add HealthCheckRuntime context manager for shared boilerplate [1/2]#77
gustcol wants to merge 2 commits intofacebookresearch:mainfrom
gustcol:feature/health-check-runtime

Conversation

@gustcol
Copy link
Contributor

@gustcol gustcol commented Mar 1, 2026

Summary

Ref: #75

  • Introduce HealthCheckRuntime, a @dataclass context manager that encapsulates the ~30 lines of repeated setup code every health check subcommand duplicates: logger initialization, GPU node ID detection, derived cluster resolution, ExitStack with TelemetryContext + OutputContext, and killswitch checking
  • Reduces per-subcommand boilerplate from ~30 lines to ~5 lines (a with HealthCheckRuntime(...) as rt: block)
  • Purely additive — existing checks continue to work unchanged; no migration in this PR

Stacked PR series: [1/2] Runtime helper → [2/2] Scaffold tool (depends on this PR)

Before (~30 lines per subcommand)

node = socket.gethostname()
logger, _ = init_logger(...)
try: gpu_node_id = gni_lib.get_gpu_node_id()
except: gpu_node_id = None; ...
derived_cluster = get_derived_cluster(...)
exit_code = ExitCode.UNKNOWN
msg = ""
with ExitStack() as s:
    s.enter_context(TelemetryContext(...))
    s.enter_context(OutputContext(...))
    ff = FeatureValueHealthChecksFeatures()
    if ff.get_healthchecksfeatures_disable_check_sensors():
        exit_code = ExitCode.OK; ...
    # actual check logic
    sys.exit(exit_code.value)

After (~5 lines per subcommand)

with HealthCheckRuntime(
    cluster=cluster, type=type, log_level=log_level,
    log_folder=log_folder, sink=sink, sink_opts=sink_opts,
    verbose_out=verbose_out,
    heterogeneous_cluster_v1=heterogeneous_cluster_v1,
    health_check_name=HealthCheckName.CHECK_SENSORS,
    killswitch_getter=lambda: FeatureValueHealthChecksFeatures()
        .get_healthchecksfeatures_disable_check_sensors(),
) as rt:
    # actual check logic using rt.logger, rt.node, etc.
    rt.finish(exit_code, msg)

Test plan

  • nox -s tests -- gcm/tests/health_checks_tests/test_runtime.py — 6 tests covering initialization, killswitch behavior, finish(), context nesting, GPU node ID failure
  • nox -s lint
  • nox -s format
  • nox -s typecheck

…plate

Extract the ~30 lines of repeated setup code (logger init, GPU node ID
detection, derived cluster resolution, TelemetryContext + OutputContext
nesting, killswitch check) into a reusable HealthCheckRuntime dataclass
context manager. This reduces per-subcommand boilerplate from ~30 lines
to ~5 lines.

The helper is purely additive — existing checks continue to work
unchanged. New checks can use `with HealthCheckRuntime(...) as rt:`
instead of manually wiring up the setup ceremony.

Includes comprehensive tests covering field initialization, killswitch
behavior, context manager nesting, GPU node ID failure handling, and
the finish() convenience method.

Refs: facebookresearch#75
@github-actions
Copy link

github-actions bot commented Mar 1, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Apply ufmt formatting and fix mypy errors in test helper
by using explicit typed parameters instead of **kwargs dict
unpacking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant