Skip to content

fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952

Open
Kernel-9 wants to merge 1 commit intovolcano-sh:mainfrom
Kernel-9:main
Open

fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952
Kernel-9 wants to merge 1 commit intovolcano-sh:mainfrom
Kernel-9:main

Conversation

@Kernel-9
Copy link
Copy Markdown

@Kernel-9 Kernel-9 commented May 6, 2026

What type of PR is this?
/kind bug

What this PR does / why we need it:
In processPrometheusString, when decoder.Decode returned a non-EOF error, the loop used continue, which caused the same decoding error to be logged repeatedly in a tight loop. Once the underlying reader/decoder enters a bad state (e.g. Negative metrics cause parsing anomalies), Decode keeps returning the same error, so the loop spins and floods the log with identical error decoding metric: ... messages.

This PR changes that branch from continue to break, so the decoder loop exits on the first non-EOF decoding error instead of spinning forever.

The other two continue statements in the same loop are intentionally kept:

  • len(mf.Metric) < 1: a successfully decoded but empty MetricFamily — the next Decode reads a new, independent family, so skipping only this one is correct.
  • WatchMetricList miss: normal filtering of metrics we don't care about (e.g. go_* / process_*); breaking here would drop every watched metric that happens to be emitted after an unwatched one.

Special notes for your reviewer:

  • Scope is intentionally minimal: only the decoder-error branch is changed; filtering / per-sample skip logic is untouched.
  • Behavior change on malformed input: the collector now aborts parsing the current pod's metric payload on the first decode error instead of retrying in-place. Metrics already collected before the error in the same payload are still applied via addMetric. The next scrape cycle will try again normally.
  • No new dependencies, no API / CRD / config changes.

Does this PR introduce a user-facing change?:

NONE

Copilot AI review requested due to automatic review settings May 6, 2026 06:52
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Welcome @Kernel-9! It looks like this is your first PR to volcano-sh/kthena 🎉

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the metric collector to break the decoding loop when encountering malformed Prometheus metrics, preventing potential infinite loops. The reviewer pointed out that while this fix prevents the loop, it may result in partial metric data being used, which could lead to incorrect autoscaling decisions. It is recommended to refactor the function to return an error so the caller can mark the instance as failed instead of processing incomplete data.

continue
// Stop decoding on malformed input to avoid spinning forever
// if decoder keeps returning the same non-EOF error.
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using break here correctly prevents the infinite loop, but it allows the collector to proceed with partial metrics for the current pod. Since instanceMetricMap is an accumulator across all pods and missing metrics are filled with 0 at the end of this function (lines 238-242), this can lead to an artificially low aggregate metric value. This might cause the autoscaler to make incorrect decisions, such as scaling down during a period of high load if metrics parsing fails.

Consider refactoring processPrometheusString to return an error, and in the caller (fetchMetricsFromPods), handle this error by marking the instance as failed (instanceInfo.IsFailed = true). This would ensure that corrupted or partial data doesn't impact autoscaling logic.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a potential tight-loop log flood in the autoscaler’s Prometheus text parsing by stopping the decode loop on non-EOF decode errors, preventing repeated logging of the same decode failure when the decoder/reader enters a bad state.

Changes:

  • Change processPrometheusString to break (instead of continue) on non-io.EOF decode errors.
  • Add an inline comment documenting why the loop exits on malformed input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants