fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952
fix(autoscaler/metric_collector): exit decode loop on non-EOF error#952Kernel-9 wants to merge 1 commit intovolcano-sh:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @Kernel-9! It looks like this is your first PR to volcano-sh/kthena 🎉 |
There was a problem hiding this comment.
Code Review
This pull request modifies the metric collector to break the decoding loop when encountering malformed Prometheus metrics, preventing potential infinite loops. The reviewer pointed out that while this fix prevents the loop, it may result in partial metric data being used, which could lead to incorrect autoscaling decisions. It is recommended to refactor the function to return an error so the caller can mark the instance as failed instead of processing incomplete data.
| continue | ||
| // Stop decoding on malformed input to avoid spinning forever | ||
| // if decoder keeps returning the same non-EOF error. | ||
| break |
There was a problem hiding this comment.
Using break here correctly prevents the infinite loop, but it allows the collector to proceed with partial metrics for the current pod. Since instanceMetricMap is an accumulator across all pods and missing metrics are filled with 0 at the end of this function (lines 238-242), this can lead to an artificially low aggregate metric value. This might cause the autoscaler to make incorrect decisions, such as scaling down during a period of high load if metrics parsing fails.
Consider refactoring processPrometheusString to return an error, and in the caller (fetchMetricsFromPods), handle this error by marking the instance as failed (instanceInfo.IsFailed = true). This would ensure that corrupted or partial data doesn't impact autoscaling logic.
There was a problem hiding this comment.
Pull request overview
Fixes a potential tight-loop log flood in the autoscaler’s Prometheus text parsing by stopping the decode loop on non-EOF decode errors, preventing repeated logging of the same decode failure when the decoder/reader enters a bad state.
Changes:
- Change
processPrometheusStringtobreak(instead ofcontinue) on non-io.EOFdecode errors. - Add an inline comment documenting why the loop exits on malformed input.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Kernel-9 <617084524@qq.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
In
processPrometheusString, whendecoder.Decodereturned a non-EOF error, the loop usedcontinue, which caused the same decoding error to be logged repeatedly in a tight loop. Once the underlying reader/decoder enters a bad state (e.g. Negative metrics cause parsing anomalies),Decodekeeps returning the same error, so the loop spins and floods the log with identicalerror decoding metric: ...messages.This PR changes that branch from
continuetobreak, so the decoder loop exits on the first non-EOF decoding error instead of spinning forever.The other two
continuestatements in the same loop are intentionally kept:len(mf.Metric) < 1: a successfully decoded but emptyMetricFamily— the nextDecodereads a new, independent family, so skipping only this one is correct.WatchMetricListmiss: normal filtering of metrics we don't care about (e.g.go_*/process_*); breaking here would drop every watched metric that happens to be emitted after an unwatched one.Special notes for your reviewer:
addMetric. The next scrape cycle will try again normally.Does this PR introduce a user-facing change?: