fix(autoscaler/metric_collector): exit decode loop on non-EOF error by Kernel-9 · Pull Request #952 · volcano-sh/kthena

Kernel-9 · 2026-05-06T06:52:54Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
In processPrometheusString, when decoder.Decode returned a non-EOF error, the loop used continue, which caused the same decoding error to be logged repeatedly in a tight loop. Once the underlying reader/decoder enters a bad state (e.g. Negative metrics cause parsing anomalies), Decode keeps returning the same error, so the loop spins and floods the log with identical error decoding metric: ... messages.

This PR changes that branch from continue to break, so the decoder loop exits on the first non-EOF decoding error instead of spinning forever.

The other two continue statements in the same loop are intentionally kept:

len(mf.Metric) < 1: a successfully decoded but empty MetricFamily — the next Decode reads a new, independent family, so skipping only this one is correct.
WatchMetricList miss: normal filtering of metrics we don't care about (e.g. go_* / process_*); breaking here would drop every watched metric that happens to be emitted after an unwatched one.

Special notes for your reviewer:

Scope is intentionally minimal: only the decoder-error branch is changed; filtering / per-sample skip logic is untouched.
Behavior change on malformed input: the collector now aborts parsing the current pod's metric payload on the first decode error instead of retrying in-place. Metrics already collected before the error in the same payload are still applied via addMetric. The next scrape cycle will try again normally.
No new dependencies, no API / CRD / config changes.

Does this PR introduce a user-facing change?:

NONE

volcano-sh-bot · 2026-05-06T06:53:02Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-05-06T06:53:05Z

Welcome @Kernel-9! It looks like this is your first PR to volcano-sh/kthena 🎉

gemini-code-assist

Code Review

This pull request modifies the metric collector to break the decoding loop when encountering malformed Prometheus metrics, preventing potential infinite loops. The reviewer pointed out that while this fix prevents the loop, it may result in partial metric data being used, which could lead to incorrect autoscaling decisions. It is recommended to refactor the function to return an error so the caller can mark the instance as failed instead of processing incomplete data.

gemini-code-assist · 2026-05-06T06:54:07Z

-			continue
+			// Stop decoding on malformed input to avoid spinning forever
+			// if decoder keeps returning the same non-EOF error.
+			break


Using break here correctly prevents the infinite loop, but it allows the collector to proceed with partial metrics for the current pod. Since instanceMetricMap is an accumulator across all pods and missing metrics are filled with 0 at the end of this function (lines 238-242), this can lead to an artificially low aggregate metric value. This might cause the autoscaler to make incorrect decisions, such as scaling down during a period of high load if metrics parsing fails.

Consider refactoring processPrometheusString to return an error, and in the caller (fetchMetricsFromPods), handle this error by marking the instance as failed (instanceInfo.IsFailed = true). This would ensure that corrupted or partial data doesn't impact autoscaling logic.

Copilot

Pull request overview

Fixes a potential tight-loop log flood in the autoscaler’s Prometheus text parsing by stopping the decode loop on non-EOF decode errors, preventing repeated logging of the same decode failure when the decoder/reader enters a bad state.

Changes:

Change processPrometheusString to break (instead of continue) on non-io.EOF decode errors.
Add an inline comment documenting why the loop exits on malformed input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Kernel-9 <617084524@qq.com>

Copilot AI review requested due to automatic review settings May 6, 2026 06:52

volcano-sh-bot added the kind/bug label May 6, 2026

volcano-sh-bot requested review from git-malu and hzxuzhonghu May 6, 2026 06:52

volcano-sh-bot added the size/XS label May 6, 2026

Copilot started reviewing on behalf of Kernel-9 May 6, 2026 06:53 View session

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Copilot AI reviewed May 6, 2026

View reviewed changes

fix(autoscaler/metric_collector): exit decode loop on non-EOF error

0f04185

Signed-off-by: Kernel-9 <617084524@qq.com>

Kernel-9 force-pushed the main branch from 986f451 to 0f04185 Compare May 6, 2026 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952

fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952
Kernel-9 wants to merge 1 commit intovolcano-sh:mainfrom
Kernel-9:main

Kernel-9 commented May 6, 2026

Uh oh!

volcano-sh-bot commented May 6, 2026

Uh oh!

volcano-sh-bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kernel-9 commented May 6, 2026

Uh oh!

volcano-sh-bot commented May 6, 2026

Uh oh!

volcano-sh-bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants