-
Notifications
You must be signed in to change notification settings - Fork 292
DPTP-2938: Allow pod scaler to reduce resources based on past 3 weeks of usage data #4886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughAdds time-windowed (3-week) recommendation filtering with minimum-sample guards and minimum CPU/memory floors; replaces recommendation logic with authoritative-aware, gradual reductions (ignore <5% changes, cap 25% per cycle); extends admission/mutation APIs and flags; updates tests, fixtures, e2e env and logging. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepsm007 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@deepsm007: This pull request references DPTP-2938 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission.go (1)
212-218: Do not apply recommendation values to Limits since digestData only populates Requests.The loop processes both Requests and Limits (lines 216-217), but
digestDataonly populatesRequests(see resources.go line 115:s.byMetaData[meta].Requests[request] = *q). The Limits field is initialized as empty and never filled with values.When processing Limits:
our(the recommendation) is a zero-value Quantitytheir(the configured limit) may have an actual value- When
cmp < 0(zero < configured), line 247 sets the limit to zeroThis unintentionally removes memory limits that were explicitly configured. The
reconcileLimitsfunction cannot protect against this because it skips processing when limits are already zero (line 270).Either populate Limits in
digestDataor exclude Limits from processing inapplyRecommendationsBasedOnRecentDatasince recommendations for Limits do not exist.
🤖 Fix all issues with AI Agents
In @cmd/pod-scaler/resources_test.go:
- Around line 100-104: The comment and error message are misleading: digestData
returns the raw quantile value (no 1.2 multiplier— that's applied in
applyRecommendationsBasedOnRecentData), and the test range is 70000-100000
(i.e., 70,000–100,000 millicores, or 70–100 cores worth of millicores). Update
the inline comment to state that digestData returns the raw quantile (no
multiplier) and adjust the t.Errorf text to reflect the actual checked range
(e.g., "Expected CPU request to be based on recent data (around 70,000-100,000
millicores), got %d millicores") and/or use human-readable units (70-100 cores)
so the message matches the check on cpuRequestMilli.
🧹 Nitpick comments (1)
cmd/pod-scaler/admission.go (1)
255-260: Consider removing the deprecated wrapper sooner.The TODO to remove
useOursIfLargeris noted. Since this is a private function and the only caller has been updated (line 318), this wrapper could be removed in this PR to avoid dead code.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/admission.go
🧬 Code graph analysis (2)
cmd/pod-scaler/resources.go (1)
test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
ResourceRequirements(246-253)pkg/results/report.go (1)
PodScalerReporter(174-176)
🔇 Additional comments (8)
cmd/pod-scaler/resources.go (2)
43-48: LGTM - Clear constant definition with good documentation.The 21-day window is well-documented and the rationale for using recent data to enable both increases and decreases is clearly explained.
83-103: Verify behavior when data straddles the cutoff boundary.The filtering logic is sound. One consideration: if a workload stops running for >3 weeks and then restarts, it will have no recommendations until new data is collected. This seems intentional but worth confirming this is the desired behavior for workloads with intermittent usage patterns.
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)
14-54: Test fixture updates align with the new resource reduction behavior.The updated values correctly reflect that the pod scaler can now reduce resources based on recent usage data. The uniform values across containers (6 CPU, ~20Gi memory) indicate the recommendation is being applied consistently.
cmd/pod-scaler/resources_test.go (2)
18-91: Good test coverage for the filtering behavior.The test correctly validates that old data (4 weeks ago) is excluded while recent data (1 week ago) is included. The use of distinct histogram values (100 for old, 50/75 for recent) makes it easy to verify which data was used.
112-170: LGTM - Solid negative test case.This test correctly verifies that no recommendation is produced when all data points are outside the 3-week window.
cmd/pod-scaler/admission.go (1)
221-224: The 20% boost truncation is acceptable but worth noting.The
int64()cast truncates fractional values. For memory this means losing sub-byte precision (negligible), and for CPU it means losing sub-millicore precision (also negligible in practice).cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (2)
13-24: Test fixture correctly reflects the combined changes.
- CPU limit removed (handled by
reconcileLimits)- Memory limit increased to 480000000 (200% of request per
reconcileLimitslogic, since request 240000000 * 2 = 480000000)- Requests show the resource reduction behavior in action
55-77: Verify the intentional behavior of setting all containers to the same values.Containers with previously different resource configurations (CPU: 2, 10; Memory: 100) are now all set to the same recommendation (6 CPU, 240000000 memory). This implies all containers share the same metadata lookup key, resulting in uniform recommendations.
|
Scheduling required tests: Scheduling tests matching the |
|
@deepsm007: This pull request references DPTP-2938 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
3d13a7f to
0b355c7
Compare
|
@deepsm007: This pull request references DPTP-2938 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission.go (1)
227-229: Reconsider applying 1.2x safety margin before authoritative reductions.The 1.2x multiplier is applied to all recommendations unconditionally (lines 227-229), including when authoritative mode will subsequently reduce resources. This creates an inconsistency:
Example scenario:
- Historical usage data: 100 units
- After 1.2x multiplier: 120 units (recommendation)
- Current configured: 200 units
- Reduction from 200 → 120 = 40%
- Capped at 25% max reduction → final: 150 units
However, the actual usage was only 100 units. The safety margin makes sense when increasing resources, but when reducing in authoritative mode, the 25% per-cycle cap and 5% minimum threshold already provide substantial safety. Applying the 1.2x multiplier before calculating reductions may slow down necessary right-sizing.
Consider:
- Skip the 1.2x multiplier when authoritative mode will reduce (i.e., when
cmp < 0and authoritative is true)- Or apply the multiplier conditionally only for increases
Potential approach
increased := our.AsApproximateFloat64() * 1.2 - our.Set(int64(increased)) their := (*pair.theirs)[field] + authoritative := (field == corev1.ResourceCPU && authoritativeCPU) || (field == corev1.ResourceMemory && authoritativeMemory) + + // Apply 1.2x safety margin only when increasing or when not in authoritative mode + tempOur := our.DeepCopy() + if !authoritative || their.IsZero() { + tempOur.Set(int64(increased)) + } + fieldLogger := logger.WithFields(logrus.Fields{ "workloadName": workloadName, "workloadType": workloadType, "field": field, "resource": pair.resource, - "determined": our.String(), + "determined": tempOur.String(), "configured": their.String(), }) - cmp := our.Cmp(their) + cmp := tempOur.Cmp(their) if cmp == 1 { + // Apply safety margin when increasing + tempOur.Set(int64(increased)) fieldLogger.Debug("determined amount larger than configured, increasing resources") - (*pair.theirs)[field] = our + (*pair.theirs)[field] = tempOur if their.Value() > 0 && our.Value() > (their.Value()*10) { reporter.ReportResourceConfigurationWarning(workloadName, workloadType, their.String(), our.String(), field.String()) } } else if cmp < 0 { - authoritative := (field == corev1.ResourceCPU && authoritativeCPU) || (field == corev1.ResourceMemory && authoritativeMemory) if authoritative { // Apply gradual reduction with safety limits: max 25% reduction per cycle, minimum 5% difference - ourValue := our.AsApproximateFloat64() + ourValue := tempOur.AsApproximateFloat64()
🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)
105-116: Consider performance implications of recency weighting approach.The current implementation applies recency weights by calling
overall.Merge(hist)multiple times in a loop (lines 114-116). For data with many recent samples and high weights (3x), this could result in a significant number of merge operations.Consider whether there's a more efficient approach, such as:
- Checking if the histogram library supports weighted merges directly
- Sampling or aggregating data before weighting
- Profiling to determine if this is actually a bottleneck in practice
Note: This is flagged as a recommended refactor since the current approach is functionally correct, but may have performance implications at scale.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/pod-scaler/resources_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/admission.gocmd/pod-scaler/resources.gocmd/pod-scaler/admission_test.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
PodScalerReporter(174-176)pkg/api/types.go (1)
ResourceRequirements(246-253)
cmd/pod-scaler/admission_test.go (1)
pkg/api/types.go (2)
ResourceList(261-261)ResourceRequirements(246-253)
🔇 Additional comments (5)
cmd/pod-scaler/main.go (1)
64-71: LGTM!The authoritative flags are properly added to the options structure, bound to CLI flags with clear descriptions, and correctly passed through to the admission handler. The implementation follows existing patterns in the codebase.
Also applies to: 94-95, 275-275
cmd/pod-scaler/admission.go (2)
36-36: LGTM!The authoritative flags are correctly wired through the admission webhook handler and podMutator struct, following existing patterns.
Also applies to: 47-47, 54-65
247-276: Well-structured reduction logic with appropriate safety constraints.The authoritative mode reduction logic includes good safeguards:
- Maximum 25% reduction per cycle (line 255) prevents aggressive changes
- Minimum 5% threshold (line 257) avoids unnecessary micro-adjustments
- Proper handling of edge cases (zero values at lines 269-272)
- Clear logging for each decision path
These constraints ensure gradual, controlled resource optimization rather than sudden changes that could impact workload stability.
cmd/pod-scaler/admission_test.go (2)
557-557: Test updates are consistent with the new function signature.The test updates correctly:
- Add the new authoritative boolean flags to function calls
- Use
false, falsefor basic mutation tests (line 557)- Use
true, truefor authoritative behavior tests (lines 774, 859)The tests are well-structured and document the expected behavior clearly.
Also applies to: 774-774, 859-859
687-697: Test expectations validate current behavior including 1.2x multiplier.The test expectations are mathematically correct for the current implementation. The inline comments clearly document that the 1.2x safety margin is applied before reduction calculations (e.g., "Ours: 10 * 1.2 = 12, Theirs: 200...").
Note: If the 1.2x multiplier application is changed as suggested in the
admission.goreview, these test expectations will need to be updated accordingly to reflect:
- Raw usage values (without 1.2x) when calculating authoritative reductions
- Only applying 1.2x when increasing resources
Also applies to: 758-768
|
/pipeline required |
|
Scheduling required tests: Scheduling tests matching the |
|
/test images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)
126-135: Minor logging inconsistency.Line 135 logs "merged all fingerprints" but now only recent fingerprints are merged. Consider updating this log message for accuracy.
📝 Suggested fix
metaLogger.Tracef("merged %d recent fingerprints (out of %d total)", recentCount, len(fingerprintTimes)) - metaLogger.Trace("merged all fingerprints") + metaLogger.Trace("merged recent fingerprints")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
.gitignorecmd/pod-scaler/resources.gotest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/resources.gotest/e2e/pod-scaler/run/producer.gotest/e2e/pod-scaler/run/consumer.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
🔇 Additional comments (6)
test/e2e/pod-scaler/run/producer.go (1)
30-32: LGTM!The environment variable setup correctly inherits existing environment variables and appends the test-specific override for
POD_SCALER_MIN_SAMPLES. This aligns with the consumer.go pattern and thegetMinSamplesForRecommendation()function in resources.go.test/e2e/pod-scaler/run/consumer.go (1)
66-72: LGTM!The environment variables are correctly passed to the admission accessory, enabling the test to work with the reduced sample count requirement.
cmd/pod-scaler/resources.go (4)
42-66: LGTM!The constants are well-documented with clear rationale. The 3-week window, minimum thresholds, and configurable sample count provide a solid foundation for the recency-based recommendations.
101-125: LGTM!The recency filtering and weighting logic is well-structured. Computing
cutoffTimeandnowoutside the loop ensures consistency across all iterations. The tiered weighting approach (3x/2x/1x) appropriately prioritizes more recent data.
146-156: LGTM!The minimum threshold clamping correctly prevents recommending zero or extremely small values that could cause scheduling issues. The format specifications (DecimalSI for CPU, BinarySI for memory) match the original quantity creation functions.
170-180: LGTM!The function correctly parses the environment variable with appropriate validation (must be a positive integer) and falls back to a sensible default of 3 samples. The silent fallback on parse error is acceptable for configuration overrides.
2b0465d to
69cde76
Compare
Prucek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some small comments, otherwise LGTM.
|
Scheduling required tests: Scheduling tests matching the |
1 similar comment
|
Scheduling required tests: Scheduling tests matching the |
69cde76 to
b0b3eec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured_test.go:
- Around line 24-36: The test constructs a FullMetadata manually causing
potential key mismatch; replace the manual construction with the canonical
builder MetadataFor from the podscaler package so Target, Step and Pod match
production semantics (e.g., use
podscaler.MetadataFor("test-org","test-repo","main","test-target","test-step","test-pod")
or the appropriate overload) and then call bqClient.ShouldBeMeasured(meta) so
the test uses the same key shape as production.
In @cmd/pod-scaler/measured.go:
- Around line 89-95: The Update method on MeasuredPodCache currently replaces
c.data wholesale which can drop previous entries on partial BigQuery responses;
modify MeasuredPodCache.Update to lock c.mu, iterate the incoming map parameter
and upsert each entry into the existing c.data map (initialize c.data if nil)
instead of assigning c.data = data, and after merging compare sizes to detect
large decreases and call c.logger.Warnf (or Infof) with context when the new
batch is significantly smaller than the previous cache; keep the lock during the
merge to ensure thread-safety and update the existing log message to report
merged count.
- Around line 145-252: The BigQuery query in BigQueryClient.Refresh is a
placeholder referencing `pod_metrics` and columns like `pod_scaler_label`,
`cpu_utilization`, `memory_utilization`, and `container_durations` that may not
match the real ci-metrics schema; update Refresh (and the Query construction
using MeasuredPodDataRetentionDays) to either (a) use the actual table/column
names or make the table/column names configurable via config/env so it doesn’t
assume `pod_metrics`, (b) add schema validation/parsing guards for
`container_durations` and other columns before using them, (c) add
unit/integration tests that mock the BigQuery client
(`BigQueryClient.Query`/iterator.Read behavior) to cover Refresh’s parsing and
error paths, and (d) update the pod-scaler README to document the BigQuery
integration and configuration or gate this code behind a feature flag so
production isn’t using an unvalidated query.
🧹 Nitpick comments (6)
cmd/pod-scaler/resources.go (3)
60-66: Environment variable read at init time may cause issues with testing.
minSamplesForRecommendationis initialized once when the package loads. This means tests cannot easily override this value without setting the env var before the package initializes. Consider making this a parameter or using a lazy initialization pattern if testability is a concern.
101-102: Potential time drift between cutoff and weighting calculations.
cutoffTimeandnoware computed at different instants. While the drift is negligible in practice, using a singlenowvalue for both would be more consistent.♻️ Suggested fix
- cutoffTime := time.Now().Add(-resourceRecommendationWindow) - now := time.Now() + now := time.Now() + cutoffTime := now.Add(-resourceRecommendationWindow)
173-180: Missing validation for negative values in environment variable.The check
minSamples > 0handles zero and negative values, but the function silently falls back to the default when parsing fails. Consider logging a warning when the env var is set but invalid.♻️ Suggested improvement
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { return minSamples + } else { + logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default 3", val) } } return 3 }cmd/pod-scaler/measured.go (3)
128-138: Missing error propagation from initial refresh.The initial refresh error on startup is only logged as a warning (line 130), but execution continues. This is acceptable for resilience. However, the daily refresh scheduled via
TickLiteralwill silently fail without any retry mechanism. Consider implementing exponential backoff for transient failures.
201-207: Row read errors are logged but iteration continues.When
it.Next(&row)fails with an error other thaniterator.Done, the error is logged and iteration continues. This is reasonable for resilience, but consider tracking error counts and failing the refresh if too many rows fail to parse.
410-413: Fallback to first container may not reflect actual longest-running workload.When
longestContaineris nil due to missing duration data, the code falls back to the first container. This fallback may apply resource increases to a sidecar if it happens to be listed first. Consider logging this as a warning or using container naming heuristics (e.g., prefer "test" or non-sidecar names).
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (9)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gotest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (5)
- .gitignore
- cmd/pod-scaler/resources_test.go
- cmd/pod-scaler/admission_test.go
- test/e2e/pod-scaler/run/consumer.go
- test/e2e/pod-scaler/run/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
cmd/pod-scaler/measured.go (1)
pkg/api/types.go (2)
Metadata(161-166)ResourceList(261-261)
🔇 Additional comments (11)
cmd/pod-scaler/resources.go (2)
107-134: LGTM on recency filtering and weighting logic.The time-windowed filtering with progressive weighting (3x/2x/1x based on age) and minimum sample enforcement are well-implemented. The skip logic for insufficient recent data prevents unreliable recommendations.
146-155: LGTM on minimum threshold enforcement.Clamping CPU to 10 millicores and memory to 10 MiB prevents zero or near-zero recommendations that could cause scheduling or OOM issues.
cmd/pod-scaler/admission.go (4)
199-206: LGTM on function rename and documentation.The new name
applyRecommendationsBasedOnRecentDatabetter reflects the bidirectional nature of the function (can both increase and decrease resources).
224-227: Good addition: skip resources with no recommendation.This prevents overwriting user-configured values when the pod-scaler has no data for a particular resource.
249-273: Reduction logic applies to the already-inflated value.The 1.2x multiplier (line 231) is applied to
ourbefore the comparison. Whencmp < 0, the reduction percentage is calculated against the inflatedourvalue, not the original recommendation. This may result in slightly different reduction behavior than intended (reducing from an already-buffered value).Verify this is the intended behavior. If reductions should be based on raw recommendations, move the inflation after the comparison logic.
265-269: Integer truncation may cause inconsistent capping.
our.Set(int64(maxAllowed))truncates the float result. For memory values this is fine, but for CPU this loses precision sinceSetexpects the base unit (cores), not millicores. However, since the quantity was already created as a milli-quantity or byte-quantity, this should still work correctly via the existing format.cmd/pod-scaler/measured_test.go (3)
63-131: LGTM on TestClassifyPod.Good coverage of three scenarios: no data (measured), recent data (normal), and nil BigQuery client (defaults to measured). The use of
MetadataForin test case 2 ensures metadata matches what the production code generates.
151-210: LGTM on TestAddPodAntiAffinity.Tests correctly verify anti-affinity is added for measured and normal pods, and not added for unlabeled pods.
212-253: LGTM on TestMeasuredPodCache.Tests verify Get returns false for missing data and returns correct values after Update.
cmd/pod-scaler/measured.go (2)
360-370: LGTM on anti-affinity rule merging.The code correctly preserves existing anti-affinity rules by appending rather than overwriting.
434-448: ApplyMeasuredPodResources is not called in the admission path; no double-buffering issue.
ApplyMeasuredPodResourcesis not invoked anywhere in the codebase (only defined at line 376 but with zero call sites). The active function in the admission webhook isapplyRecommendationsBasedOnRecentData(called at line 337 in admission.go), which applies the 1.2x buffer during pod mutation. Since only one function operates on pods during admission, there is no double-buffering concern.Likely an incorrect or invalid review comment.
b0b3eec to
454b866
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.gitignore:
- Line 13: The .gitignore entry "/job-run-aggregatorpod-scaler" is concatenated
and won't ignore either binary; split it into two separate ignore patterns by
replacing that entry with "/job-run-aggregator" and "/pod-scaler" so both
binaries are matched and not accidentally committed.
In @cmd/pod-scaler/admission_test.go:
- Around line 734-735: The test case name "ours are smaller - should reduce
resources based on recent usage" is duplicated; update the name string in the
test case struct (the `name:` field inside the test table entry that begins with
that text) to a unique, descriptive value (e.g., include the specific scenario
or line reference such as "ours are smaller - should reduce resources based on
recent usage (recent-window case)") so test output can unambiguously identify
which case failed. Ensure only the `name` field is changed and keep the rest of
the test case data and assertions intact.
In @cmd/pod-scaler/measured.go:
- Around line 148-152: The TODO claiming the BigQuery string is a "placeholder
query" is outdated; update the comment around the query := bq.client.Query(...)
to either remove the TODO entirely or replace it with a clarifying note that the
query is intentionally implemented and depends on the external pod_metrics
BigQuery table schema, and mention the corresponding row struct (the struct with
bigquery field tags used to map results) so readers know the schema dependency;
no code changes required beyond editing the comment near query and the row
struct remark.
🧹 Nitpick comments (4)
cmd/pod-scaler/resources.go (1)
173-180: Consider logging a warning on invalid environment variable values.Currently, if
POD_SCALER_MIN_SAMPLEScontains an invalid value (non-numeric or <= 0), the function silently falls back to 3. This could mask configuration errors in production. A debug/warning log would help operators identify misconfigurations.♻️ Optional: Add logging for invalid values
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { return minSamples + } else { + logrus.WithField("POD_SCALER_MIN_SAMPLES", val).Warn("Invalid value for POD_SCALER_MIN_SAMPLES, using default of 3") } } return 3 }cmd/pod-scaler/resources_test.go (1)
90-93: Consider test isolation from environment.The tests implicitly rely on
minSamplesForRecommendationbeing 3 (the default). IfPOD_SCALER_MIN_SAMPLESis set in the test environment, behavior could differ. SinceminSamplesForRecommendationis initialized at package load, consider usingt.Setenv("POD_SCALER_MIN_SAMPLES", "3")at the start of each test to ensure isolation, or document this dependency.cmd/pod-scaler/admission_test.go (1)
572-572: Consider renaming test function to match the renamed function.The test function is still named
TestUseOursIfLargerbut it now testsapplyRecommendationsBasedOnRecentData. For consistency and discoverability, consider renaming toTestApplyRecommendationsBasedOnRecentData.Also applies to: 774-774
cmd/pod-scaler/admission.go (1)
199-206: Doc comment references non-existent test function name.The comment mentions
TestApplyRecommendationsBasedOnRecentData_ReducesResourcesbut the actual test isTestUseOursIfLarger. Update the comment to match the actual test name.Suggested fix
// applyRecommendationsBasedOnRecentData applies resource recommendations based on recent usage data // (see resourceRecommendationWindow). If they used more, we increase resources. If they used less, // we decrease them (pod-scaler is always authoritative). // -// TestApplyRecommendationsBasedOnRecentData_ReducesResources is tested in admission_test.go -// as part of TestUseOursIfLarger. The reduction functionality is verified there with proper -// test cases that handle ResourceQuantity comparison correctly. +// The reduction functionality is tested in admission_test.go as part of TestUseOursIfLarger +// with test cases that verify ResourceQuantity comparison and the 25% reduction cap. func applyRecommendationsBasedOnRecentData(allOfOurs, allOfTheirs *corev1.ResourceRequirements, workloadName, workloadType string, reporter results.PodScalerReporter, logger *logrus.Entry) {
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (11)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e/pod-scaler/run/producer.go
- cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
test/e2e/pod-scaler/run/consumer.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/admission.gocmd/pod-scaler/resources.gocmd/pod-scaler/measured.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧬 Code graph analysis (5)
cmd/pod-scaler/resources_test.go (2)
pkg/pod-scaler/types.go (3)
FullMetadata(374-385)CachedQuery(53-68)FingerprintTime(71-76)pkg/api/types.go (2)
Metadata(161-166)ResourceRequirements(246-253)
cmd/pod-scaler/admission_test.go (2)
pkg/api/types.go (2)
ResourceList(261-261)ResourceRequirements(246-253)pkg/testhelper/accessory.go (1)
T(51-57)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
ResourceRequirements(246-253)pkg/results/report.go (1)
PodScalerReporter(174-176)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (19)
test/e2e/pod-scaler/run/consumer.go (1)
71-71: LGTM!Setting
POD_SCALER_MIN_SAMPLES=1for e2e tests is appropriate since test data typically has fewer samples than production. This correctly aligns with the new environment-configurable threshold introduced inresources.go.cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)
15-18: LGTM!Test fixture updates correctly reflect the new resource recommendation values with the updated structured representation including string annotations.
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)
16-24: LGTM!Fixture values correctly updated to match the new resource recommendation behavior with recency-weighted calculations.
cmd/pod-scaler/resources.go (3)
42-66: LGTM! Well-documented constants and configurable threshold.The 3-week recommendation window, minimum resource thresholds, and environment-configurable sample count are well-designed. The inline documentation clearly explains the rationale for each constant.
107-134: LGTM! Solid recency-based filtering with tiered weighting.The filtering logic correctly excludes data outside the 3-week window and applies sensible weights (3x/2x/1x) based on data age. The minimum sample threshold guard prevents recommendations based on insufficient data.
146-155: LGTM! Good defensive clamping.Applying minimum thresholds after quantile calculation prevents recommending zero or extremely small values that could cause scheduling failures.
cmd/pod-scaler/resources_test.go (2)
18-127: LGTM! Good test coverage for recency-based filtering.The test correctly validates that old data (4 weeks ago) is excluded while recent data (1 week ago) is used for recommendations. Providing exactly 3 recent samples meets the default
minSamplesForRecommendationthreshold.
129-191: LGTM! Good edge case test.Correctly validates that no recommendation is created when all data falls outside the 3-week window.
cmd/pod-scaler/admission_test.go (2)
664-699: LGTM!The test case correctly verifies the 25% reduction cap behavior. The inline comments accurately explain the math, and the expected values are correct.
722-732: LGTM!The mixed scenario test correctly verifies that larger values get the 1.2x increase while smaller values get the 25% reduction cap applied.
cmd/pod-scaler/admission.go (2)
249-273: LGTM!The reduction logic correctly implements gradual resource reduction with safety limits:
- Skips trivial changes (<5%)
- Caps aggressive reductions at 25% per cycle
- Properly handles the zero case
The 25% cap per cycle is a sensible safeguard against sudden resource starvation.
224-227: LGTM!Good defensive check to skip resources where no recommendation exists, preventing accidental clearing of configured values.
cmd/pod-scaler/measured.go (7)
65-95: LGTM!The cache implementation is thread-safe with appropriate read/write lock usage. The atomic map replacement in
Updateis clean and avoids partial update issues.
128-131: Verify: Startup failure leaves cache empty, causing all pods to be measured.When the initial BigQuery refresh fails, the service continues with an empty cache. Per line 294-296, this causes all pods to default to "measured" classification. While this is a safe default (pods get isolated), it may cause scheduling pressure if BigQuery is temporarily unavailable. Consider whether this behavior is intentional.
201-207: LGTM!Continuing on individual row errors is resilient and appropriate. Consider adding a counter for failed rows if debugging becomes needed in the future.
308-371: LGTM!The anti-affinity logic correctly:
- Isolates measured pods from all other pod-scaler workloads
- Allows normal pods to share nodes while avoiding measured pods
- Preserves existing affinity rules by merging
376-386: Clarify: This applies resources to "measured" labeled pods, but those are pods being measured, not pods with prior data.The function applies resource recommendations to pods labeled
PodScalerLabelValueMeasured. However, per the classification logic inClassifyPod(lines 284-297), pods are labeled "measured" when they lack recent measured data and need to be measured. This seems contradictory - you're applying measured data to pods that don't have measured data yet.Should this be checking for
PodScalerLabelValueNormalinstead (pods that have recent measured data)?
410-418: LGTM!The fallback to the first container when duration data is unavailable is reasonable. The subsequent check for measured data (lines 423-427) handles the case where no data exists.
274-306: LGTM!The classification logic correctly marks pods as "measured" when any container needs fresh measurement data, and defaults safely when BigQuery is unavailable.
454b866 to
7ef32ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @.gitignore:
- Line 13: The .gitignore line 'job-run-aggregatorpod-scaler' is a
concatenated/merge-messed entry; replace it with two separate ignore entries
(i.e., split the concatenated token into the two intended paths) by locating the
'job-run-aggregatorpod-scaler' entry and adding each intended path as its own
line (verify the correct directory names in the repo—e.g., job-run-aggregator
and aggregator-pod-scaler or the actual folder names) so each ignore pattern is
on a separate line.
In @cmd/pod-scaler/admission_test.go:
- Around line 734-770: The two table-driven test entries share the identical
name "ours are smaller - should reduce resources based on recent usage", causing
one to be shadowed; locate the second test case in admission_test.go that
defines the `ours`, `theirs`, and `expected` ResourceRequirements (the block
with Limits/Requests showing CPU 50->150 and memory 1e9->2.25e10 etc.) and give
its `name` field a unique string (e.g. append a qualifier like "(memory-heavy)"
or "(reduction cap)"), so each test row is distinct in output.
🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)
169-180: Consider logging when env var override is applied.The implementation correctly handles invalid values by falling back to the default. However, it might be helpful to log when the override is active, especially since this affects production behavior.
Optional: Add logging for env var override
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { + logrus.Infof("Using POD_SCALER_MIN_SAMPLES override: %d", minSamples) return minSamples } + logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value: %q, using default", val) } return 3 }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (9)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/pod-scaler/resources_test.go
- cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/resources.gotest/e2e/pod-scaler/run/producer.gocmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gotest/e2e/pod-scaler/run/consumer.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
🧬 Code graph analysis (3)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
ResourceRequirements(246-253)pkg/results/report.go (1)
PodScalerReporter(174-176)
cmd/pod-scaler/admission_test.go (1)
pkg/api/types.go (2)
ResourceList(261-261)ResourceRequirements(246-253)
🔇 Additional comments (11)
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)
1-90: Test fixture updates align with the new resource reduction logic.The fixture changes correctly reflect the new
applyRecommendationsBasedOnRecentDatabehavior including:
- CPU limits being removed (per
reconcileLimits)- Resource values adjusted according to the 25% max reduction cap
- Minimum threshold enforcement
test/e2e/pod-scaler/run/consumer.go (1)
71-71: LGTM!Setting
POD_SCALER_MIN_SAMPLES=1for e2e tests appropriately bypasses the production minimum sample requirement, allowing tests to run with limited test data.test/e2e/pod-scaler/run/producer.go (1)
31-31: LGTM!Consistent with the consumer setup, correctly setting the minimum samples override for e2e tests while preserving the existing environment.
cmd/pod-scaler/admission_test.go (1)
772-779: LGTM!Test cases comprehensively cover the new reduction logic including:
- 25% max reduction cap per cycle
- The 1.2x multiplier applied to recommendations
- Both increase and decrease scenarios
The function rename from
useOursIfLargertoapplyRecommendationsBasedOnRecentDatacorrectly reflects the expanded bidirectional behavior.cmd/pod-scaler/admission.go (4)
199-206: LGTM!Good function rename reflecting the new bidirectional behavior. The doc comments clearly explain that the scaler is now authoritative for both increases and decreases.
224-227: LGTM!The zero-value check correctly prevents applying empty recommendations, which could happen when there's no recent data for this resource type.
249-273: Sound gradual reduction implementation.The logic correctly implements:
- 5% minimum threshold to avoid micro-adjustments
- 25% maximum reduction cap per cycle for safety
- Proper handling of zero-value edge case
One consideration: using
AsApproximateFloat64()is acceptable here since you're computing percentages for thresholds, not exact equality comparisons. The potential precision loss is negligible for these safety checks.
337-337: LGTM!Call site correctly updated to use the renamed function.
cmd/pod-scaler/resources.go (3)
45-66: Well-designed constants and configurable minimum samples.The 3-week window, minimum thresholds, and configurable sample requirement provide good defaults with appropriate flexibility for testing. The documentation comments clearly explain the rationale.
107-133: Effective time-windowing with recency weighting.The weighting scheme (3x for past week, 2x for 1-2 weeks, 1x for 2-3 weeks) appropriately emphasizes recent data while still considering slightly older patterns. The minimum sample requirement prevents unreliable recommendations from sparse data.
One note: the Merge operation with weighted repetition is a valid approach for histogram aggregation - each Merge adds the histogram's counts, effectively giving more weight to recent samples.
146-155: LGTM!Minimum threshold enforcement correctly prevents dangerously low recommendations. Applied after quantile calculation ensures the safety floor is always respected.
7ef32ad to
445193c
Compare
|
@deepsm007: This pull request references DPTP-2938 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission_test.go (1)
664-699: Duplicate test case names will cause confusion.Both test cases at lines 664-699 and 734-770 have the identical name
"ours are smaller - should reduce resources based on recent usage". This makes test output ambiguous and hinders debugging. Rename one to distinguish the scenarios (e.g., include the specific values being tested).Also applies to: 734-770
🧹 Nitpick comments (2)
cmd/pod-scaler/resources.go (2)
60-66: Package-level initialization may cause test isolation issues.
minSamplesForRecommendationis initialized once at package load time viagetMinSamplesForRecommendation(). If tests need different values, they cannot easily override this after import. Consider making it a function call at usage sites or providing a setter for test configurability.
173-180: Error handling for invalid env var values is silent.If
POD_SCALER_MIN_SAMPLESis set to a non-integer or negative value, the function silently falls back to 3. Consider logging a warning when the env var is present but invalid to aid debugging configuration issues.♻️ Suggested improvement
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { return minSamples } + logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default of 3", val) } return 3 }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (10)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (5)
- cmd/pod-scaler/resources_test.go
- .gitignore
- test/e2e/pod-scaler/run/producer.go
- test/e2e/pod-scaler/run/consumer.go
- cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/admission_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/main.gocmd/pod-scaler/admission.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
🧬 Code graph analysis (3)
cmd/pod-scaler/admission_test.go (2)
pkg/api/types.go (2)
ResourceList(261-261)ResourceRequirements(246-253)pkg/testhelper/accessory.go (1)
T(51-57)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
PodScalerReporter(174-176)pkg/api/types.go (1)
ResourceRequirements(246-253)
🔇 Additional comments (9)
cmd/pod-scaler/admission.go (3)
36-47: LGTM on admission wiring updates.The new authoritative flags are properly threaded through the admit function and podMutator struct initialization.
265-288: Gradual reduction logic is sound.The 25% max reduction per cycle and 5% minimum difference threshold provide safe guardrails. The logic correctly handles the edge case where
theirValueis zero.
343-362: LGTM on mutatePodResources signature update.The function properly propagates the new authoritative flags to the resource application logic.
cmd/pod-scaler/main.go (2)
94-95: Verify that defaulting authoritative mode totrueis intentional.Both
--authoritative-cpuand--authoritative-memorydefault totrue, meaning resource reductions will be applied immediately upon deployment. Confirm this is the desired rollout strategy, as it changes behavior for all existing workloads.
275-276: LGTM on admit call site update.The new flags are correctly passed through to the admission handler.
cmd/pod-scaler/admission_test.go (1)
557-557: LGTM on mutatePodResources test call update.The test correctly passes the new authoritative flags.
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)
1-90: LGTM on fixture update.The updated expected values correctly reflect the new resource recommendation logic with the 120% experimental bump and minimum thresholds.
cmd/pod-scaler/resources.go (2)
101-134: Time-based filtering and weighting logic is well-implemented.The recency-based weighting (3x for past week, 2x for 1-2 weeks, 1x for 2-3 weeks) appropriately prioritizes recent usage patterns. The early exit when insufficient samples exist prevents unreliable recommendations.
146-155: LGTM on minimum threshold enforcement.Clamping CPU to 10 millicores and memory to 10 MiB prevents scheduling issues from extremely small recommendations.
445193c to
059588d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cmd/pod-scaler/resources_test.go (1)
119-126: Consider tightening the expected value range.The assertion bounds (70000-100000 millicores) are quite wide for histogram values of 50, 60, and 75. At the 0.8 quantile, the expected value should be closer to 70-75 cores. The current upper bound of 100000 may not catch regressions where old data inadvertently influences the result.
Additionally, the assertion at line 124 (
> 120000) may never trigger since it's already covered by the condition at line 119 (> 100000).♻️ Suggested tighter bounds
- if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 { - t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli) - } - - // Verify it's not based on the old high value (100.0) - if cpuRequestMilli > 120000 { - t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli) - } + // With values 50, 60, 75 at 0.8 quantile with weighting, expect approximately 70-80 cores + if cpuRequestMilli < 60000 || cpuRequestMilli > 85000 { + t.Errorf("Expected CPU request to be based on recent data (around 60000-85000 millicores), got %d millicores", cpuRequestMilli) + }cmd/pod-scaler/resources.go (1)
173-180: Consider logging a warning for invalidPOD_SCALER_MIN_SAMPLESvalues.When
POD_SCALER_MIN_SAMPLESis set to an invalid value (non-integer or ≤0), the function silently falls back to the default. This could mask configuration errors.♻️ Suggested improvement
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { return minSamples + } else { + logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default of 3", val) } } return 3 }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (10)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
test/e2e/pod-scaler/run/consumer.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/resources.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/admission.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/main.gotest/e2e/pod-scaler/run/producer.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧬 Code graph analysis (1)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
PodScalerReporter(174-176)pkg/api/types.go (1)
ResourceRequirements(246-253)
🔇 Additional comments (19)
test/e2e/pod-scaler/run/consumer.go (1)
71-71: LGTM!The addition of
POD_SCALER_MIN_SAMPLES=1is appropriate for e2e testing, allowing the pod-scaler to make recommendations with minimal sample data rather than requiring the default threshold.test/e2e/pod-scaler/run/producer.go (1)
8-8: LGTM!The environment variable setup is consistent with the consumer configuration and correctly inherits the existing environment before appending the test-specific setting.
Also applies to: 31-31
cmd/pod-scaler/resources_test.go (2)
18-127: Test coverage for time-window filtering looks good.The test effectively validates that old data (4 weeks ago) is filtered out while recent data (1 week ago) is used for recommendations.
129-191: LGTM!Clear and effective test validating that no recommendations are produced when all data is outside the 3-week time window.
cmd/pod-scaler/admission_test.go (4)
557-557: LGTM!The call site correctly passes the new authoritative flags (
true, true) to enable the gradual reduction behavior in tests.
664-699: Test case accurately reflects the 25% max reduction per cycle logic.The expected values correctly demonstrate:
- CPU: 200 * 0.75 = 150 (limits), 100 * 0.75 = 75 (requests)
- Memory: 3e10 * 0.75 = 2.25e10 (limits), 2e10 * 0.75 = 1.5e10 (requests)
This validates the safety mechanism preventing drastic reductions in a single cycle.
734-770: New test case provides good coverage for medium-value scenarios.Correctly validates that even moderate differences between recommended and current resources are capped at the 25% reduction limit when the calculated reduction would exceed that threshold.
772-779: LGTM!Function rename from
useOursIfLargertoapplyRecommendationsBasedOnRecentDatabetter reflects the new bidirectional behavior (can both increase and decrease resources).cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)
1-90: Fixture correctly updated to reflect new resource recommendation behavior.The changes demonstrate:
- Gradual CPU reductions (respecting the 25% cap)
- Memory normalization consistent with minimum thresholds
- Proper format string representations for resources
These expectations align with the
applyRecommendationsBasedOnRecentDataimplementation and authoritative mode flags.cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)
1-68: LGTM!The test fixture updates correctly reflect the new resource recommendation logic with typed
resource.int64Amountvalues and their string representations. The adjusted CPU and memory values align with the implementation changes in the admission and resources modules.cmd/pod-scaler/main.go (3)
70-72: LGTM!The new
authoritativeCPUandauthoritativeMemoryfields are properly added toconsumerOptionswith clear naming that reflects their purpose.
94-95: LGTM!The CLI flags are well-defined with sensible defaults (
true) and clear descriptions explaining that authoritative mode allows decreasing resources based on recent usage. The default-on behavior aligns with the PR objective to enable gradual reductions.
275-276: LGTM!The new flags are correctly passed through to the
admit()function.cmd/pod-scaler/resources.go (2)
45-66: LGTM!The constants for the 3-week recommendation window and minimum resource floors are well-documented and provide sensible safeguards. The environment variable override for
minSamplesForRecommendationis useful for testing flexibility.
107-134: LGTM!The time-based filtering and weighted merging logic is well-structured. The weighting scheme (3x for <7 days, 2x for 7-14 days, 1x for 14-21 days) appropriately prioritizes recent data while still considering older samples within the window. The early-exit conditions for insufficient data prevent unreliable recommendations.
cmd/pod-scaler/admission.go (4)
36-36: LGTM!The function signature is correctly extended to include the new authoritative mode flags.
63-65: LGTM!The
podMutatorstruct properly includes the new fields for per-resource authoritative control.
201-294: Solid implementation of gradual resource reduction with appropriate safeguards.The logic correctly implements:
- Per-resource authoritative mode checks
- Noise reduction by skipping <5% adjustments
- Safety limit capping reductions at 25% per cycle
- Proper handling of zero-value edge cases
The 20% increase factor (line 232-233) is noted as experimental via the existing TODO comment.
343-362: LGTM!The
mutatePodResourcesfunction correctly propagates the authoritative flags through to the recommendation application logic.
|
/hold |
059588d to
7607680
Compare
Updated default to false "off" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/pod-scaler/resources.go (1)
98-159: Avoid potential panic: guard missing/nil histograms before calling.Histogram()and merging.
data.Data[fingerprintTime.Fingerprint]is assumed present; if it’s missing (or nil), this will crash the scaler.Proposed fix
func (s *resourceServer) digestData(data *podscaler.CachedQuery, quantile float64, request corev1.ResourceName, quantity toQuantity) { logger := s.logger.WithField("resource", request) logger.Debugf("Digesting %d identifiers.", len(data.DataByMetaData)) - cutoffTime := time.Now().Add(-resourceRecommendationWindow) - now := time.Now() + now := time.Now() + cutoffTime := now.Add(-resourceRecommendationWindow) for meta, fingerprintTimes := range data.DataByMetaData { overall := circonusllhist.New() metaLogger := logger.WithField("meta", meta) metaLogger.Tracef("digesting %d fingerprints", len(fingerprintTimes)) recentCount := 0 for _, fingerprintTime := range fingerprintTimes { if fingerprintTime.Added.After(cutoffTime) { - hist := data.Data[fingerprintTime.Fingerprint].Histogram() + h, ok := data.Data[fingerprintTime.Fingerprint] + if !ok || h == nil { + metaLogger.Warnf("missing histogram for fingerprint %v, skipping", fingerprintTime.Fingerprint) + continue + } + hist := h.Histogram() // Weight more recent data more heavily to make the scaler more sensitive to recent runs. // Past week: 3x weight, 1-2 weeks ago: 2x weight, 2-3 weeks ago: 1x weight. age := now.Sub(fingerprintTime.Added) weight := 1 if age < 7*24*time.Hour { weight = 3 } else if age < 14*24*time.Hour { weight = 2 } for i := 0; i < weight; i++ { overall.Merge(hist) } recentCount++ } }
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/resources_test.go:
- Around line 18-127: The test depends on the POD_SCALER_MIN_SAMPLES env and
current window, so make it deterministic by setting the server's minSamples and
resourceRecommendationWindow directly before calling digestData: set
resourceServer.minSamples = 3 and resourceServer.resourceRecommendationWindow =
14*24*time.Hour (so fourWeeksAgo is excluded and oneWeekAgo included), then call
server.digestData(...) and assert via server.recommendedRequestFor(meta) that
the recommendation reflects only the recent fingerprints; this ensures old data
is excluded regardless of environment.
🧹 Nitpick comments (4)
cmd/pod-scaler/resources.go (1)
60-66: Config via env is fine for e2e, but make it test/ops-friendly: log invalid values and allow runtime override.
Right now invalidPOD_SCALER_MIN_SAMPLESsilently falls back to 3, and the value is captured at init (harder to vary in unit tests / long-running processes).Also applies to: 170-180
cmd/pod-scaler/resources_test.go (1)
129-191: Also de-flake this test: pinminSamplesForRecommendationso ambient env can’t change behavior.
Right now, an externally-setPOD_SCALER_MIN_SAMPLEScould cause this test to fail/behave unexpectedly because the package-level var is initialized from env.cmd/pod-scaler/admission_test.go (2)
554-568: Add coverage for authoritative=false (especially CPU) to match real default behavior
mutatePodResources(..., true, true, ...)hard-codes authoritative mode on in all cases here, so the test suite won’t catch regressions where reductions must be skipped whenauthoritativeCPU/authoritativeMemoryare disabled. Consider adding at least one case for each flag being false (CPU-only false is particularly important given the CLI default).
664-779: Add a millicore CPU test case to guard Quantity rounding/truncationThe new reduction scenarios are good, but they all use whole-number CPU quantities. Please add a case where CPU is in millicores (e.g., configured
500m, recommended200m) so the tests fail if the implementation accidentally truncates fractional cores during the 1.2x headroom or 25% reduction clamp.Also applies to: 857-865
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (10)
.gitignorecmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/resources.gocmd/pod-scaler/resources_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e/pod-scaler/run/producer.go
- .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/resources_test.gocmd/pod-scaler/main.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.difftest/e2e/pod-scaler/run/consumer.gocmd/pod-scaler/admission.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diffcmd/pod-scaler/resources.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
PodScalerReporter(174-176)pkg/api/types.go (1)
ResourceRequirements(246-253)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
🔇 Additional comments (7)
test/e2e/pod-scaler/run/consumer.go (1)
66-72: Good: e2e explicitly setsPOD_SCALER_MIN_SAMPLES=1to avoid window/sample-count coupling.
This should make e2e less brittle given the 3-week filtering + min-sample guard.cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)
13-19: Sanity-check fixture expectations against the “max 25% reduction per cycle” rule (rounding can break caps).
Given these fixtures now encode new steady-state requests, please ensure the underlying capping logic can’t be violated by quantity rounding (e.g., whole cores / Gi rounding).Also applies to: 31-37, 50-55
cmd/pod-scaler/main.go (2)
259-276: Good: flags are correctly plumbed into admission path
mainAdmissionpasses bothopts.authoritativeCPUandopts.authoritativeMemoryintoadmit(...), keeping wiring consistent end-to-end.
60-97: Help text adequately documents the authoritative-cpu default; clarify authoritative-memory default insteadThe
--authoritative-cpu=falsedefault is documented in the code via help text explaining it defaults to false due to CPU measurements being affected by node contention. However, the asymmetry is worth addressing:--authoritative-memory=truedefaults to true without explaining why. If the memory default is intentionally permissive (or has different reasoning), document that reasoning in the help text alongside the CPU explanation for consistency.cmd/pod-scaler/admission.go (3)
36-48: LGTM: authoritative flags are plumbed into the webhook handler
Signature/struct wiring looks consistent:main.go -> admit -> podMutator -> mutatePodResources.Also applies to: 54-66, 88-104
343-362: Design fragility: recommendation limits are implicitly prevented, not explicitly enforcedThe concern about
mutateResourceLimits=falsebeing bypassed doesn't currently exist in practice becausedigestDataonly populates theRequestsfield of recommendations—theLimitsfield is initialized but never written to. Consequently, whenapplyRecommendationsBasedOnRecentDataprocesses limits, theif our.IsZero()check skips them.However, this safety relies on implicit contract enforcement in
digestData. For maintainability, explicitly document that recommendations never include limits, or add an explicit guard inmutatePodResourcesthat skips the recommendations→limits path unlessmutateResourceLimitsis enabled.Likely an incorrect or invalid review comment.
201-294: [rewritten review comment]
[classification tag]
| func TestDigestData_FiltersOldData(t *testing.T) { | ||
| now := time.Now() | ||
| fourWeeksAgo := now.Add(-28 * 24 * time.Hour) | ||
| oneWeekAgo := now.Add(-7 * 24 * time.Hour) | ||
|
|
||
| // Create test data with fingerprints at different times | ||
| oldFingerprint := model.Fingerprint(1) | ||
| recentFingerprint1 := model.Fingerprint(2) | ||
| recentFingerprint2 := model.Fingerprint(3) | ||
| recentFingerprint3 := model.Fingerprint(4) | ||
|
|
||
| // Create histograms with different values so we can verify which ones are used | ||
| oldHist := circonusllhist.New() | ||
| if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored | ||
| t.Fatalf("failed to record value: %v", err) | ||
| } | ||
|
|
||
| recentHist1 := circonusllhist.New() | ||
| if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used | ||
| t.Fatalf("failed to record value: %v", err) | ||
| } | ||
|
|
||
| recentHist2 := circonusllhist.New() | ||
| if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used | ||
| t.Fatalf("failed to record value: %v", err) | ||
| } | ||
|
|
||
| recentHist3 := circonusllhist.New() | ||
| if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used | ||
| t.Fatalf("failed to record value: %v", err) | ||
| } | ||
|
|
||
| meta := podscaler.FullMetadata{ | ||
| Metadata: api.Metadata{ | ||
| Org: "test-org", | ||
| Repo: "test-repo", | ||
| Branch: "main", | ||
| }, | ||
| Container: "test-container", | ||
| } | ||
|
|
||
| data := &podscaler.CachedQuery{ | ||
| Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{ | ||
| oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist), | ||
| recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1), | ||
| recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2), | ||
| recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3), | ||
| }, | ||
| DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{ | ||
| meta: { | ||
| // Old data - should be filtered out | ||
| { | ||
| Fingerprint: oldFingerprint, | ||
| Added: fourWeeksAgo, | ||
| }, | ||
| // Recent data - should be included (need at least 3 samples) | ||
| { | ||
| Fingerprint: recentFingerprint1, | ||
| Added: oneWeekAgo, | ||
| }, | ||
| { | ||
| Fingerprint: recentFingerprint2, | ||
| Added: oneWeekAgo, | ||
| }, | ||
| { | ||
| Fingerprint: recentFingerprint3, | ||
| Added: oneWeekAgo, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| server := &resourceServer{ | ||
| logger: logrus.WithField("test", "TestDigestData_FiltersOldData"), | ||
| byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{}, | ||
| } | ||
|
|
||
| // The digestData function uses time.Now() internally to calculate the cutoff. | ||
| // Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function | ||
| // will correctly filter based on the current time and resourceRecommendationWindow. | ||
|
|
||
| // Digest the data | ||
| server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity { | ||
| return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI) | ||
| }) | ||
|
|
||
| // Verify that the recommendation was created | ||
| recommendation, exists := server.recommendedRequestFor(meta) | ||
| if !exists { | ||
| t.Fatal("Expected recommendation to exist, but it doesn't") | ||
| } | ||
|
|
||
| // The recommendation should be based on recent data only (50.0, 75.0, 60.0) | ||
| // At 0.8 quantile with weighted averaging (3x weight for past week), it should be | ||
| // closer to 75.0 than 100.0. Note: digestData returns the raw quantile value; | ||
| // the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData. | ||
| cpuRequest := recommendation.Requests[corev1.ResourceCPU] | ||
| cpuRequestMilli := cpuRequest.MilliValue() | ||
|
|
||
| // The value should be based on recent data (around 70-100 cores worth of millicores) | ||
| // We allow some variance due to histogram quantization | ||
| if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 { | ||
| t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli) | ||
| } | ||
|
|
||
| // Verify it's not based on the old high value (100.0) | ||
| if cpuRequestMilli > 120000 { | ||
| t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighten the test: make it independent of POD_SCALER_MIN_SAMPLES env and actually prove old data is excluded.
Proposed fix
package main
import (
"testing"
"time"
@@
func TestDigestData_FiltersOldData(t *testing.T) {
+ // Avoid test behavior depending on the ambient environment.
+ origMin := minSamplesForRecommendation
+ minSamplesForRecommendation = 3
+ t.Cleanup(func() { minSamplesForRecommendation = origMin })
+
now := time.Now()
fourWeeksAgo := now.Add(-28 * 24 * time.Hour)
- oneWeekAgo := now.Add(-7 * 24 * time.Hour)
+ // Keep this comfortably within the "past week" bucket to avoid boundary effects.
+ oneWeekAgo := now.Add(-6 * 24 * time.Hour)
@@
- oldHist := circonusllhist.New()
- if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored
+ oldHist := circonusllhist.New()
+ if err := oldHist.RecordValue(1000.0); err != nil { // Much higher value - must be ignored
t.Fatalf("failed to record value: %v", err)
}
@@
- // The value should be based on recent data (around 70-100 cores worth of millicores)
- // We allow some variance due to histogram quantization
- if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 {
- t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli)
- }
-
- // Verify it's not based on the old high value (100.0)
- if cpuRequestMilli > 120000 {
- t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli)
- }
+ // Should be driven by recent values (~75 cores => ~75000m), not the old 1000-core spike.
+ if cpuRequestMilli < 60000 || cpuRequestMilli > 120000 {
+ t.Errorf("Expected CPU request to be based on recent data (~60000-120000m), got %d millicores", cpuRequestMilli)
+ }
+ if cpuRequestMilli > 200000 {
+ t.Errorf("CPU request appears to include old data (1000.0), got %d millicores", cpuRequestMilli)
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestDigestData_FiltersOldData(t *testing.T) { | |
| now := time.Now() | |
| fourWeeksAgo := now.Add(-28 * 24 * time.Hour) | |
| oneWeekAgo := now.Add(-7 * 24 * time.Hour) | |
| // Create test data with fingerprints at different times | |
| oldFingerprint := model.Fingerprint(1) | |
| recentFingerprint1 := model.Fingerprint(2) | |
| recentFingerprint2 := model.Fingerprint(3) | |
| recentFingerprint3 := model.Fingerprint(4) | |
| // Create histograms with different values so we can verify which ones are used | |
| oldHist := circonusllhist.New() | |
| if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist1 := circonusllhist.New() | |
| if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist2 := circonusllhist.New() | |
| if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist3 := circonusllhist.New() | |
| if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| meta := podscaler.FullMetadata{ | |
| Metadata: api.Metadata{ | |
| Org: "test-org", | |
| Repo: "test-repo", | |
| Branch: "main", | |
| }, | |
| Container: "test-container", | |
| } | |
| data := &podscaler.CachedQuery{ | |
| Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{ | |
| oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist), | |
| recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1), | |
| recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2), | |
| recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3), | |
| }, | |
| DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{ | |
| meta: { | |
| // Old data - should be filtered out | |
| { | |
| Fingerprint: oldFingerprint, | |
| Added: fourWeeksAgo, | |
| }, | |
| // Recent data - should be included (need at least 3 samples) | |
| { | |
| Fingerprint: recentFingerprint1, | |
| Added: oneWeekAgo, | |
| }, | |
| { | |
| Fingerprint: recentFingerprint2, | |
| Added: oneWeekAgo, | |
| }, | |
| { | |
| Fingerprint: recentFingerprint3, | |
| Added: oneWeekAgo, | |
| }, | |
| }, | |
| }, | |
| } | |
| server := &resourceServer{ | |
| logger: logrus.WithField("test", "TestDigestData_FiltersOldData"), | |
| byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{}, | |
| } | |
| // The digestData function uses time.Now() internally to calculate the cutoff. | |
| // Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function | |
| // will correctly filter based on the current time and resourceRecommendationWindow. | |
| // Digest the data | |
| server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity { | |
| return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI) | |
| }) | |
| // Verify that the recommendation was created | |
| recommendation, exists := server.recommendedRequestFor(meta) | |
| if !exists { | |
| t.Fatal("Expected recommendation to exist, but it doesn't") | |
| } | |
| // The recommendation should be based on recent data only (50.0, 75.0, 60.0) | |
| // At 0.8 quantile with weighted averaging (3x weight for past week), it should be | |
| // closer to 75.0 than 100.0. Note: digestData returns the raw quantile value; | |
| // the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData. | |
| cpuRequest := recommendation.Requests[corev1.ResourceCPU] | |
| cpuRequestMilli := cpuRequest.MilliValue() | |
| // The value should be based on recent data (around 70-100 cores worth of millicores) | |
| // We allow some variance due to histogram quantization | |
| if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 { | |
| t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli) | |
| } | |
| // Verify it's not based on the old high value (100.0) | |
| if cpuRequestMilli > 120000 { | |
| t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli) | |
| } | |
| } | |
| func TestDigestData_FiltersOldData(t *testing.T) { | |
| // Avoid test behavior depending on the ambient environment. | |
| origMin := minSamplesForRecommendation | |
| minSamplesForRecommendation = 3 | |
| t.Cleanup(func() { minSamplesForRecommendation = origMin }) | |
| now := time.Now() | |
| fourWeeksAgo := now.Add(-28 * 24 * time.Hour) | |
| // Keep this comfortably within the "past week" bucket to avoid boundary effects. | |
| oneWeekAgo := now.Add(-6 * 24 * time.Hour) | |
| // Create test data with fingerprints at different times | |
| oldFingerprint := model.Fingerprint(1) | |
| recentFingerprint1 := model.Fingerprint(2) | |
| recentFingerprint2 := model.Fingerprint(3) | |
| recentFingerprint3 := model.Fingerprint(4) | |
| // Create histograms with different values so we can verify which ones are used | |
| oldHist := circonusllhist.New() | |
| if err := oldHist.RecordValue(1000.0); err != nil { // Much higher value - must be ignored | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist1 := circonusllhist.New() | |
| if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist2 := circonusllhist.New() | |
| if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| recentHist3 := circonusllhist.New() | |
| if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used | |
| t.Fatalf("failed to record value: %v", err) | |
| } | |
| meta := podscaler.FullMetadata{ | |
| Metadata: api.Metadata{ | |
| Org: "test-org", | |
| Repo: "test-repo", | |
| Branch: "main", | |
| }, | |
| Container: "test-container", | |
| } | |
| data := &podscaler.CachedQuery{ | |
| Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{ | |
| oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist), | |
| recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1), | |
| recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2), | |
| recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3), | |
| }, | |
| DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{ | |
| meta: { | |
| // Old data - should be filtered out | |
| { | |
| Fingerprint: oldFingerprint, | |
| Added: fourWeeksAgo, | |
| }, | |
| // Recent data - should be included (need at least 3 samples) | |
| { | |
| Fingerprint: recentFingerprint1, | |
| Added: oneWeekAgo, | |
| }, | |
| { | |
| Fingerprint: recentFingerprint2, | |
| Added: oneWeekAgo, | |
| }, | |
| { | |
| Fingerprint: recentFingerprint3, | |
| Added: oneWeekAgo, | |
| }, | |
| }, | |
| }, | |
| } | |
| server := &resourceServer{ | |
| logger: logrus.WithField("test", "TestDigestData_FiltersOldData"), | |
| byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{}, | |
| } | |
| // The digestData function uses time.Now() internally to calculate the cutoff. | |
| // Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function | |
| // will correctly filter based on the current time and resourceRecommendationWindow. | |
| // Digest the data | |
| server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity { | |
| return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI) | |
| }) | |
| // Verify that the recommendation was created | |
| recommendation, exists := server.recommendedRequestFor(meta) | |
| if !exists { | |
| t.Fatal("Expected recommendation to exist, but it doesn't") | |
| } | |
| // The recommendation should be based on recent data only (50.0, 75.0, 60.0) | |
| // At 0.8 quantile with weighted averaging (3x weight for past week), it should be | |
| // closer to 75.0 than 1000.0. Note: digestData returns the raw quantile value; | |
| // the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData. | |
| cpuRequest := recommendation.Requests[corev1.ResourceCPU] | |
| cpuRequestMilli := cpuRequest.MilliValue() | |
| // Should be driven by recent values (~75 cores => ~75000m), not the old 1000-core spike. | |
| if cpuRequestMilli < 60000 || cpuRequestMilli > 120000 { | |
| t.Errorf("Expected CPU request to be based on recent data (~60000-120000m), got %d millicores", cpuRequestMilli) | |
| } | |
| if cpuRequestMilli > 200000 { | |
| t.Errorf("CPU request appears to include old data (1000.0), got %d millicores", cpuRequestMilli) | |
| } | |
| } |
🤖 Prompt for AI Agents
In @cmd/pod-scaler/resources_test.go around lines 18 - 127, The test depends on
the POD_SCALER_MIN_SAMPLES env and current window, so make it deterministic by
setting the server's minSamples and resourceRecommendationWindow directly before
calling digestData: set resourceServer.minSamples = 3 and
resourceServer.resourceRecommendationWindow = 14*24*time.Hour (so fourWeeksAgo
is excluded and oneWeekAgo included), then call server.digestData(...) and
assert via server.recommendedRequestFor(meta) that the recommendation reflects
only the recent fingerprints; this ensures old data is excluded regardless of
environment.
| Limits: v1.ResourceList{ | ||
| - s"cpu": {i: resource.int64Amount{value: 16}, Format: "DecimalSI"}, | ||
| - s"memory": {i: resource.int64Amount{value: 400000000}, Format: "BinarySI"}, | ||
| + s"memory": {i: resource.int64Amount{value: 600000000}, Format: "BinarySI"}, | ||
| + s"memory": {i: resource.int64Amount{value: 480000000}, Format: "BinarySI"}, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
find cmd/pod-scaler -name "*.diff" | head -5Repository: openshift/ci-tools
Length of output: 321
🏁 Script executed:
cat -n cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff | head -100Repository: openshift/ci-tools
Length of output: 4103
🏁 Script executed:
# Find the mutation implementation
find cmd -name "*.go" | xargs grep -l "MutatePodResources" | head -5Repository: openshift/ci-tools
Length of output: 94
🏁 Script executed:
cat -n cmd/pod-scaler/admission_test.go | head -150Repository: openshift/ci-tools
Length of output: 6015
🏁 Script executed:
# Find the mutation implementation
find cmd/pod-scaler -name "*.go" -type f | grep -v test | head -10Repository: openshift/ci-tools
Length of output: 247
🏁 Script executed:
# Find the TestMutatePodResources test case
grep -n "TestMutatePodResources" cmd/pod-scaler/admission_test.goRepository: openshift/ci-tools
Length of output: 109
🏁 Script executed:
# Look at the admission.go file for mutation logic
cat -n cmd/pod-scaler/admission.go | head -200Repository: openshift/ci-tools
Length of output: 8972
🏁 Script executed:
# Find the mutatePodResources function
grep -n "func mutatePodResources" cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 293
🏁 Script executed:
# Read the mutatePodResources function
sed -n '200,400p' cmd/pod-scaler/admission.go | cat -nRepository: openshift/ci-tools
Length of output: 9955
🏁 Script executed:
# Get the TestMutatePodResources test case starting at line 339
sed -n '339,600p' cmd/pod-scaler/admission_test.go | cat -nRepository: openshift/ci-tools
Length of output: 11214
🏁 Script executed:
# Check if the fixture file is auto-generated or manually written
head -20 cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffRepository: openshift/ci-tools
Length of output: 1102
🏁 Script executed:
# Let me trace through the logic manually for the 10->7 case
# Check the exact math in applyRecommendationsBasedOnRecentData
cat -n cmd/pod-scaler/admission.go | sed -n '75,90p'Repository: openshift/ci-tools
Length of output: 706
🏁 Script executed:
# Get the exact reduction calculation code
sed -n '75,89p' cmd/pod-scaler/admission.go | cat -nRepository: openshift/ci-tools
Length of output: 637
🏁 Script executed:
# Now trace the 10->7 case manually:
# ours (recommendation) = 5 (from test line ~66-68), theirs = 10
# reductionPercent = 1 - (5/10) = 0.5 = 50%
# maxAllowed = 10 * (1 - 0.25) = 7.5
# int64(7.5) = 7
# This is a 30% reduction (10-7=3, 3/10=0.3), not 25%
python3 << 'EOF'
# Verify the math
theirs = 10
ours = 5
reduction_percent = 1.0 - (ours / theirs)
print(f"Reduction percent: {reduction_percent:.1%}")
max_allowed = theirs * (1.0 - 0.25)
print(f"Max allowed (float): {max_allowed}")
print(f"Max allowed (int64): {int(max_allowed)}")
actual_reduction = (theirs - int(max_allowed)) / theirs
print(f"Actual reduction after int truncation: {actual_reduction:.1%}")
EOFRepository: openshift/ci-tools
Length of output: 148
🏁 Script executed:
# Verify that the test expectations confirm this is happening
# Check the fixture to see if 10->7 is the expected result
grep -A2 "value: 10" cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff | head -5Repository: openshift/ci-tools
Length of output: 335
🏁 Script executed:
# Verify CPU limits removal is intentional by checking reconcileLimits
sed -n '97,118p' cmd/pod-scaler/admission.go | cat -nRepository: openshift/ci-tools
Length of output: 1176
🏁 Script executed:
# Check if there are any tests verifying the 25% cap behavior specifically
grep -n "25" cmd/pod-scaler/admission_test.goRepository: openshift/ci-tools
Length of output: 1458
🏁 Script executed:
# Check the comments in the reduction logic more carefully
sed -n '66,95p' cmd/pod-scaler/admission.go | cat -nRepository: openshift/ci-tools
Length of output: 1342
🏁 Script executed:
# Check the reconcileLimits function comment more carefully
sed -n '97,104p' cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 434
🏁 Script executed:
# Verify the test setup and what values are being used
sed -n '89,131p' cmd/pod-scaler/admission_test.go | cat -nRepository: openshift/ci-tools
Length of output: 1788
🏁 Script executed:
# Let me check what CPU value the test setup provides for "overcap" container
sed -n '66,71p' cmd/pod-scaler/admission_test.go | cat -nRepository: openshift/ci-tools
Length of output: 361
🏁 Script executed:
# And verify the test container that has CPU=10
sed -n '122,131p' cmd/pod-scaler/admission_test.go | cat -nRepository: openshift/ci-tools
Length of output: 568
🏁 Script executed:
# Let me verify by looking at the exact test input and output more carefully
# Check what container names are being used in the test
sed -n '88,132p' cmd/pod-scaler/admission_test.go | grep -E 'Name:|NewQuantity.*value: (10|5)' | head -20Repository: openshift/ci-tools
Length of output: 44
🏁 Script executed:
# Confirm the resource recommendations in the test
sed -n '47,72p' cmd/pod-scaler/admission_test.go | grep -E 'baseWithContainer|NewQuantity'Repository: openshift/ci-tools
Length of output: 44
🏁 Script executed:
# Let me directly verify the test container configuration
sed -n '88,132p' cmd/pod-scaler/admission_test.goRepository: openshift/ci-tools
Length of output: 1610
🏁 Script executed:
# Let me get the complete test container definitions from TestMutatePodResources
sed -n '339,500p' cmd/pod-scaler/admission_test.go | grep -A 15 "Name: \"small\"" | tail -30Repository: openshift/ci-tools
Length of output: 972
🏁 Script executed:
# Double-check the reconcileLimits function
sed -n '97,118p' cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 1022
🏁 Script executed:
# Confirm the exact reduction calculation by examining the code logic
sed -n '75,95p' cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 837
🏁 Script executed:
# Search for any related issue or TODO comments about rounding/truncation
grep -n "round\|truncat\|int64" cmd/pod-scaler/admission.go | head -20Repository: openshift/ci-tools
Length of output: 850
🏁 Script executed:
# Verify the exact test setup and expected behavior
cat -n cmd/pod-scaler/admission.go | sed -n '66,90p'Repository: openshift/ci-tools
Length of output: 1064
🏁 Script executed:
# Check if there are any recent fixes or TODOs related to this rounding issue
grep -C 3 "Set(int64" cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 762
Fix the truncation bug in the 25% reduction cap: maxAllowed = theirValue * (1.0 - maxReductionPercent) followed by our.Set(int64(maxAllowed)) truncates 7.5 to 7, allowing 30% reduction instead of 25%.
The CPU limits removal is intentional (see reconcileLimits() which explicitly deletes CPU limits, with a comment that it's not useful). However, the 10 -> 7 CPU reduction in lines 73-77 violates the "max 25% reduction per cycle" guarantee. When the cap calculation yields 7.5, truncating to int64(7) results in a 30% reduction. Round up or apply the cap before converting to int64.
Also applies to: 73-77
|
@deepsm007: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add flags to enable/disable authoritative mode for CPU and memory resource requests separately.
The pod-scaler can now decrease resource requests when authoritative mode is enabled (default: true for both), allowing gradual resource reduction based on recent usage data with safety limits (max 25% per cycle).
/cc @openshift/test-platform