Enhance resource monitoring guide with practical scenarios and best practices by ChenYi015 · Pull Request #1417 · kubeflow/arena

ChenYi015 · 2026-01-28T12:46:49Z

Purpose of this PR

This PR significantly enhances the resource monitoring guide with comprehensive, practical documentation that helps users effectively monitor cluster resources and diagnose performance issues.

Proposed changes:

Add comprehensive overview explaining the importance of monitoring for planning, debugging, optimization, and troubleshooting
Add quick start section with essential monitoring commands
Add key metrics section documenting node-level and job-level metrics
Include detailed common monitoring scenarios (GPU availability, debugging slow training, multi-GPU optimization, serving deployment)
Add monitoring best practices (regular monitoring, metric correlation, baseline establishment, alerting)
Include diagnostic troubleshooting steps for common monitoring issues
Add performance tuning recommendations based on metrics (high memory, low utilization, uneven GPU usage)
Improve overall structure with better organization and clear guidance

Change Category

Documentation update

Rationale

The original guide was minimal and didn't provide enough practical guidance for users to effectively monitor and troubleshoot their workloads. This enhanced version provides step-by-step scenarios, best practices, and performance tuning recommendations that help users identify and resolve performance bottlenecks.

…best practices - Add comprehensive overview explaining importance of monitoring - Add quick start section with common monitoring commands - Add key metrics section for node and job level monitoring - Include detailed common monitoring scenarios with diagnostic steps - Add monitoring best practices including correlation and baselining - Expand troubleshooting section for monitoring issues - Add performance tuning recommendations based on metrics - Improve overall structure and readability Signed-off-by: Yi Chen <github@chenyicn.net>

google-oss-prow · 2026-01-28T12:46:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from chenyi015. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot requested review from wsxiaozhang and xiaozhouX January 28, 2026 12:46

google-oss-prow bot added the size/L label Jan 28, 2026

ChenYi015 marked this pull request as draft January 28, 2026 12:50

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance resource monitoring guide with practical scenarios and best practices#1417

Enhance resource monitoring guide with practical scenarios and best practices#1417
ChenYi015 wants to merge 1 commit intokubeflow:masterfrom
ChenYi015:doc/improve-monitoring-docs

ChenYi015 commented Jan 28, 2026

Uh oh!

google-oss-prow bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChenYi015 commented Jan 28, 2026

Purpose of this PR

Change Category

Rationale

Uh oh!

google-oss-prow bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant