Skip to content

Enhance resource monitoring guide with practical scenarios and best practices#1417

Draft
ChenYi015 wants to merge 1 commit intokubeflow:masterfrom
ChenYi015:doc/improve-monitoring-docs
Draft

Enhance resource monitoring guide with practical scenarios and best practices#1417
ChenYi015 wants to merge 1 commit intokubeflow:masterfrom
ChenYi015:doc/improve-monitoring-docs

Conversation

@ChenYi015
Copy link
Copy Markdown
Member

Purpose of this PR

This PR significantly enhances the resource monitoring guide with comprehensive, practical documentation that helps users effectively monitor cluster resources and diagnose performance issues.

Proposed changes:

  • Add comprehensive overview explaining the importance of monitoring for planning, debugging, optimization, and troubleshooting
  • Add quick start section with essential monitoring commands
  • Add key metrics section documenting node-level and job-level metrics
  • Include detailed common monitoring scenarios (GPU availability, debugging slow training, multi-GPU optimization, serving deployment)
  • Add monitoring best practices (regular monitoring, metric correlation, baseline establishment, alerting)
  • Include diagnostic troubleshooting steps for common monitoring issues
  • Add performance tuning recommendations based on metrics (high memory, low utilization, uneven GPU usage)
  • Improve overall structure with better organization and clear guidance

Change Category

  • Documentation update

Rationale

The original guide was minimal and didn't provide enough practical guidance for users to effectively monitor and troubleshoot their workloads. This enhanced version provides step-by-step scenarios, best practices, and performance tuning recommendations that help users identify and resolve performance bottlenecks.

…best practices

- Add comprehensive overview explaining importance of monitoring
- Add quick start section with common monitoring commands
- Add key metrics section for node and job level monitoring
- Include detailed common monitoring scenarios with diagnostic steps
- Add monitoring best practices including correlation and baselining
- Expand troubleshooting section for monitoring issues
- Add performance tuning recommendations based on metrics
- Improve overall structure and readability

Signed-off-by: Yi Chen <github@chenyicn.net>
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from chenyi015. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant