Skip to content

Conversation

@deepsm007
Copy link
Contributor

@deepsm007 deepsm007 commented Jan 6, 2026

Add flags to enable/disable authoritative mode for CPU and memory resource requests separately.
The pod-scaler can now decrease resource requests when authoritative mode is enabled (default: true for both), allowing gradual resource reduction based on recent usage data with safety limits (max 25% per cycle).

/cc @openshift/test-platform

@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci bot requested a review from a team January 6, 2026 13:23
@coderabbitai
Copy link

coderabbitai bot commented Jan 6, 2026

Walkthrough

Adds time-windowed (3-week) recommendation filtering with minimum-sample guards and minimum CPU/memory floors; replaces recommendation logic with authoritative-aware, gradual reductions (ignore <5% changes, cap 25% per cycle); extends admission/mutation APIs and flags; updates tests, fixtures, e2e env and logging.

Changes

Cohort / File(s) Summary
Admission & Mutation
cmd/pod-scaler/admission.go, cmd/pod-scaler/admission_test.go, cmd/pod-scaler/main.go, cmd/pod-scaler/testdata/...
Added authoritativeCPU and authoritativeMemory flags/fields and propagated through admit, podMutator, and mutatePodResources; renamed useOursIfLargerapplyRecommendationsBasedOnRecentData; implemented authoritative-aware reduction rules, adjusted logging, and updated tests/fixtures.
Resource Aggregation & Rules
cmd/pod-scaler/resources.go, cmd/pod-scaler/resources_test.go
Added resourceRecommendationWindow (3 weeks), minimums (minCPURequestMilli, minMemoryRequestBytes) and env-driven minSamplesForRecommendation; digestData now filters/weights recent fingerprints, skips when insufficient recent data, applies minima, and logs decisions.
Tests & Fixtures
cmd/pod-scaler/resources_test.go, cmd/pod-scaler/admission_test.go, cmd/pod-scaler/testdata/...
New tests for recency filtering and skipping when no recent data; updated admission tests for new API and behavior; fixtures updated to typed quantities and revised expected resource values.
E2E & Env
test/e2e/pod-scaler/run/consumer.go, test/e2e/pod-scaler/run/producer.go
Injected POD_SCALER_MIN_SAMPLES=1 into pod-scaler process environment for e2e runs.
Misc
.gitignore
Minor gitignore update to include /pod-scaler.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2026
@deepsm007 deepsm007 changed the title Allow pod scaler to reduce resources based on past 3 weeks of usage data WIP: Allow pod scaler to reduce resources based on past 3 weeks of usage data Jan 6, 2026
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@deepsm007 deepsm007 changed the title WIP: Allow pod scaler to reduce resources based on past 3 weeks of usage data WIP: DPTP-2938: Allow pod scaler to reduce resources based on past 3 weeks of usage data Jan 6, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 6, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 6, 2026

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission.go (1)

212-218: Do not apply recommendation values to Limits since digestData only populates Requests.

The loop processes both Requests and Limits (lines 216-217), but digestData only populates Requests (see resources.go line 115: s.byMetaData[meta].Requests[request] = *q). The Limits field is initialized as empty and never filled with values.

When processing Limits:

  • our (the recommendation) is a zero-value Quantity
  • their (the configured limit) may have an actual value
  • When cmp < 0 (zero < configured), line 247 sets the limit to zero

This unintentionally removes memory limits that were explicitly configured. The reconcileLimits function cannot protect against this because it skips processing when limits are already zero (line 270).

Either populate Limits in digestData or exclude Limits from processing in applyRecommendationsBasedOnRecentData since recommendations for Limits do not exist.

🤖 Fix all issues with AI Agents
In @cmd/pod-scaler/resources_test.go:
- Around line 100-104: The comment and error message are misleading: digestData
returns the raw quantile value (no 1.2 multiplier— that's applied in
applyRecommendationsBasedOnRecentData), and the test range is 70000-100000
(i.e., 70,000–100,000 millicores, or 70–100 cores worth of millicores). Update
the inline comment to state that digestData returns the raw quantile (no
multiplier) and adjust the t.Errorf text to reflect the actual checked range
(e.g., "Expected CPU request to be based on recent data (around 70,000-100,000
millicores), got %d millicores") and/or use human-readable units (70-100 cores)
so the message matches the check on cpuRequestMilli.
🧹 Nitpick comments (1)
cmd/pod-scaler/admission.go (1)

255-260: Consider removing the deprecated wrapper sooner.

The TODO to remove useOursIfLarger is noted. Since this is a private function and the only caller has been updated (line 318), this wrapper could be removed in this PR to avoid dead code.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7a39967 and d0343ac.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/admission.go
🧬 Code graph analysis (2)
cmd/pod-scaler/resources.go (1)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
🔇 Additional comments (8)
cmd/pod-scaler/resources.go (2)

43-48: LGTM - Clear constant definition with good documentation.

The 21-day window is well-documented and the rationale for using recent data to enable both increases and decreases is clearly explained.


83-103: Verify behavior when data straddles the cutoff boundary.

The filtering logic is sound. One consideration: if a workload stops running for >3 weeks and then restarts, it will have no recommendations until new data is collected. This seems intentional but worth confirming this is the desired behavior for workloads with intermittent usage patterns.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)

14-54: Test fixture updates align with the new resource reduction behavior.

The updated values correctly reflect that the pod scaler can now reduce resources based on recent usage data. The uniform values across containers (6 CPU, ~20Gi memory) indicate the recommendation is being applied consistently.

cmd/pod-scaler/resources_test.go (2)

18-91: Good test coverage for the filtering behavior.

The test correctly validates that old data (4 weeks ago) is excluded while recent data (1 week ago) is included. The use of distinct histogram values (100 for old, 50/75 for recent) makes it easy to verify which data was used.


112-170: LGTM - Solid negative test case.

This test correctly verifies that no recommendation is produced when all data points are outside the 3-week window.

cmd/pod-scaler/admission.go (1)

221-224: The 20% boost truncation is acceptable but worth noting.

The int64() cast truncates fractional values. For memory this means losing sub-byte precision (negligible), and for CPU it means losing sub-millicore precision (also negligible in practice).

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (2)

13-24: Test fixture correctly reflects the combined changes.

  • CPU limit removed (handled by reconcileLimits)
  • Memory limit increased to 480000000 (200% of request per reconcileLimits logic, since request 240000000 * 2 = 480000000)
  • Requests show the resource reduction behavior in action

55-77: Verify the intentional behavior of setting all containers to the same values.

Containers with previously different resource configurations (CPU: 2, 10; Memory: 100) are now all set to the same recommendation (6 CPU, 240000000 memory). This implies all containers share the same metadata lookup key, resulting in uniform recommendations.

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007 deepsm007 changed the title WIP: DPTP-2938: Allow pod scaler to reduce resources based on past 3 weeks of usage data DPTP-2938: Allow pod scaler to reduce resources based on past 3 weeks of usage data Jan 6, 2026
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 6, 2026

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage
  • Adds pod classification system that labels pods as "normal" or "measured" based on whether they need fresh resource measurement data (measured if last measurement >10 days ago or never measured)
  • Implements podAntiAffinity rules to ensure measured pods run on isolated nodes with no other CI workloads, allowing accurate CPU/memory utilization measurement without node contention
  • Integrates BigQuery client to query and cache max CPU/memory utilization from measured pod runs, refreshing daily to keep data current
  • Applies measured resource recommendations only to the longest-running container in each pod, using actual utilization data instead of Prometheus metrics that may be skewed by node contention

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 3d13a7f to 0b355c7 Compare January 6, 2026 16:08
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 6, 2026

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

  • Time-based filtering to use only the past 3 weeks of data
  • Logic to allow resource reductions based on recent usage

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission.go (1)

227-229: Reconsider applying 1.2x safety margin before authoritative reductions.

The 1.2x multiplier is applied to all recommendations unconditionally (lines 227-229), including when authoritative mode will subsequently reduce resources. This creates an inconsistency:

Example scenario:

  • Historical usage data: 100 units
  • After 1.2x multiplier: 120 units (recommendation)
  • Current configured: 200 units
  • Reduction from 200 → 120 = 40%
  • Capped at 25% max reduction → final: 150 units

However, the actual usage was only 100 units. The safety margin makes sense when increasing resources, but when reducing in authoritative mode, the 25% per-cycle cap and 5% minimum threshold already provide substantial safety. Applying the 1.2x multiplier before calculating reductions may slow down necessary right-sizing.

Consider:

  1. Skip the 1.2x multiplier when authoritative mode will reduce (i.e., when cmp < 0 and authoritative is true)
  2. Or apply the multiplier conditionally only for increases
Potential approach
 		increased := our.AsApproximateFloat64() * 1.2
-		our.Set(int64(increased))
 
 		their := (*pair.theirs)[field]
+		authoritative := (field == corev1.ResourceCPU && authoritativeCPU) || (field == corev1.ResourceMemory && authoritativeMemory)
+		
+		// Apply 1.2x safety margin only when increasing or when not in authoritative mode
+		tempOur := our.DeepCopy()
+		if !authoritative || their.IsZero() {
+			tempOur.Set(int64(increased))
+		}
+		
 		fieldLogger := logger.WithFields(logrus.Fields{
 			"workloadName": workloadName,
 			"workloadType": workloadType,
 			"field":        field,
 			"resource":     pair.resource,
-			"determined":   our.String(),
+			"determined":   tempOur.String(),
 			"configured":   their.String(),
 		})
-		cmp := our.Cmp(their)
+		cmp := tempOur.Cmp(their)
 		if cmp == 1 {
+			// Apply safety margin when increasing
+			tempOur.Set(int64(increased))
 			fieldLogger.Debug("determined amount larger than configured, increasing resources")
-			(*pair.theirs)[field] = our
+			(*pair.theirs)[field] = tempOur
 			if their.Value() > 0 && our.Value() > (their.Value()*10) {
 				reporter.ReportResourceConfigurationWarning(workloadName, workloadType, their.String(), our.String(), field.String())
 			}
 		} else if cmp < 0 {
-			authoritative := (field == corev1.ResourceCPU && authoritativeCPU) || (field == corev1.ResourceMemory && authoritativeMemory)
 			if authoritative {
 				// Apply gradual reduction with safety limits: max 25% reduction per cycle, minimum 5% difference
-				ourValue := our.AsApproximateFloat64()
+				ourValue := tempOur.AsApproximateFloat64()
🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)

105-116: Consider performance implications of recency weighting approach.

The current implementation applies recency weights by calling overall.Merge(hist) multiple times in a loop (lines 114-116). For data with many recent samples and high weights (3x), this could result in a significant number of merge operations.

Consider whether there's a more efficient approach, such as:

  • Checking if the histogram library supports weighted merges directly
  • Sampling or aggregating data before weighting
  • Profiling to determine if this is actually a bottleneck in practice

Note: This is flagged as a recommended refactor since the current approach is functionally correct, but may have performance implications at scale.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d0343ac and 0b355c7.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/pod-scaler/resources_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/admission_test.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
cmd/pod-scaler/admission_test.go (1)
pkg/api/types.go (2)
  • ResourceList (261-261)
  • ResourceRequirements (246-253)
🔇 Additional comments (5)
cmd/pod-scaler/main.go (1)

64-71: LGTM!

The authoritative flags are properly added to the options structure, bound to CLI flags with clear descriptions, and correctly passed through to the admission handler. The implementation follows existing patterns in the codebase.

Also applies to: 94-95, 275-275

cmd/pod-scaler/admission.go (2)

36-36: LGTM!

The authoritative flags are correctly wired through the admission webhook handler and podMutator struct, following existing patterns.

Also applies to: 47-47, 54-65


247-276: Well-structured reduction logic with appropriate safety constraints.

The authoritative mode reduction logic includes good safeguards:

  • Maximum 25% reduction per cycle (line 255) prevents aggressive changes
  • Minimum 5% threshold (line 257) avoids unnecessary micro-adjustments
  • Proper handling of edge cases (zero values at lines 269-272)
  • Clear logging for each decision path

These constraints ensure gradual, controlled resource optimization rather than sudden changes that could impact workload stability.

cmd/pod-scaler/admission_test.go (2)

557-557: Test updates are consistent with the new function signature.

The test updates correctly:

  • Add the new authoritative boolean flags to function calls
  • Use false, false for basic mutation tests (line 557)
  • Use true, true for authoritative behavior tests (lines 774, 859)

The tests are well-structured and document the expected behavior clearly.

Also applies to: 774-774, 859-859


687-697: Test expectations validate current behavior including 1.2x multiplier.

The test expectations are mathematically correct for the current implementation. The inline comments clearly document that the 1.2x safety margin is applied before reduction calculations (e.g., "Ours: 10 * 1.2 = 12, Theirs: 200...").

Note: If the 1.2x multiplier application is changed as suggested in the admission.go review, these test expectations will need to be updated accordingly to reflect:

  • Raw usage values (without 1.2x) when calculating authoritative reductions
  • Only applying 1.2x when increasing resources

Also applies to: 758-768

@deepsm007
Copy link
Contributor Author

/pipeline required

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007
Copy link
Contributor Author

/test images

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)

126-135: Minor logging inconsistency.

Line 135 logs "merged all fingerprints" but now only recent fingerprints are merged. Consider updating this log message for accuracy.

📝 Suggested fix
 		metaLogger.Tracef("merged %d recent fingerprints (out of %d total)", recentCount, len(fingerprintTimes))
-		metaLogger.Trace("merged all fingerprints")
+		metaLogger.Trace("merged recent fingerprints")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0b355c7 and 3adf9b7.

📒 Files selected for processing (4)
  • .gitignore
  • cmd/pod-scaler/resources.go
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/resources.go
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/run/consumer.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
🔇 Additional comments (6)
test/e2e/pod-scaler/run/producer.go (1)

30-32: LGTM!

The environment variable setup correctly inherits existing environment variables and appends the test-specific override for POD_SCALER_MIN_SAMPLES. This aligns with the consumer.go pattern and the getMinSamplesForRecommendation() function in resources.go.

test/e2e/pod-scaler/run/consumer.go (1)

66-72: LGTM!

The environment variables are correctly passed to the admission accessory, enabling the test to work with the reduced sample count requirement.

cmd/pod-scaler/resources.go (4)

42-66: LGTM!

The constants are well-documented with clear rationale. The 3-week window, minimum thresholds, and configurable sample count provide a solid foundation for the recency-based recommendations.


101-125: LGTM!

The recency filtering and weighting logic is well-structured. Computing cutoffTime and now outside the loop ensures consistency across all iterations. The tiered weighting approach (3x/2x/1x) appropriately prioritizes more recent data.


146-156: LGTM!

The minimum threshold clamping correctly prevents recommending zero or extremely small values that could cause scheduling issues. The format specifications (DecimalSI for CPU, BinarySI for memory) match the original quantity creation functions.


170-180: LGTM!

The function correctly parses the environment variable with appropriate validation (must be a positive integer) and falls back to a sensible default of 3 samples. The silent fallback on parse error is acceptable for configuration overrides.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch 2 times, most recently from 2b0465d to 69cde76 Compare January 8, 2026 16:34
Copy link
Member

@Prucek Prucek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some small comments, otherwise LGTM.

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

1 similar comment
@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 69cde76 to b0b3eec Compare January 9, 2026 15:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured_test.go:
- Around line 24-36: The test constructs a FullMetadata manually causing
potential key mismatch; replace the manual construction with the canonical
builder MetadataFor from the podscaler package so Target, Step and Pod match
production semantics (e.g., use
podscaler.MetadataFor("test-org","test-repo","main","test-target","test-step","test-pod")
or the appropriate overload) and then call bqClient.ShouldBeMeasured(meta) so
the test uses the same key shape as production.

In @cmd/pod-scaler/measured.go:
- Around line 89-95: The Update method on MeasuredPodCache currently replaces
c.data wholesale which can drop previous entries on partial BigQuery responses;
modify MeasuredPodCache.Update to lock c.mu, iterate the incoming map parameter
and upsert each entry into the existing c.data map (initialize c.data if nil)
instead of assigning c.data = data, and after merging compare sizes to detect
large decreases and call c.logger.Warnf (or Infof) with context when the new
batch is significantly smaller than the previous cache; keep the lock during the
merge to ensure thread-safety and update the existing log message to report
merged count.
- Around line 145-252: The BigQuery query in BigQueryClient.Refresh is a
placeholder referencing `pod_metrics` and columns like `pod_scaler_label`,
`cpu_utilization`, `memory_utilization`, and `container_durations` that may not
match the real ci-metrics schema; update Refresh (and the Query construction
using MeasuredPodDataRetentionDays) to either (a) use the actual table/column
names or make the table/column names configurable via config/env so it doesn’t
assume `pod_metrics`, (b) add schema validation/parsing guards for
`container_durations` and other columns before using them, (c) add
unit/integration tests that mock the BigQuery client
(`BigQueryClient.Query`/iterator.Read behavior) to cover Refresh’s parsing and
error paths, and (d) update the pod-scaler README to document the BigQuery
integration and configuration or gate this code behind a feature flag so
production isn’t using an unvalidated query.
🧹 Nitpick comments (6)
cmd/pod-scaler/resources.go (3)

60-66: Environment variable read at init time may cause issues with testing.

minSamplesForRecommendation is initialized once when the package loads. This means tests cannot easily override this value without setting the env var before the package initializes. Consider making this a parameter or using a lazy initialization pattern if testability is a concern.


101-102: Potential time drift between cutoff and weighting calculations.

cutoffTime and now are computed at different instants. While the drift is negligible in practice, using a single now value for both would be more consistent.

♻️ Suggested fix
-	cutoffTime := time.Now().Add(-resourceRecommendationWindow)
-	now := time.Now()
+	now := time.Now()
+	cutoffTime := now.Add(-resourceRecommendationWindow)

173-180: Missing validation for negative values in environment variable.

The check minSamples > 0 handles zero and negative values, but the function silently falls back to the default when parsing fails. Consider logging a warning when the env var is set but invalid.

♻️ Suggested improvement
 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
 		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
 			return minSamples
+		} else {
+			logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default 3", val)
 		}
 	}
 	return 3
 }
cmd/pod-scaler/measured.go (3)

128-138: Missing error propagation from initial refresh.

The initial refresh error on startup is only logged as a warning (line 130), but execution continues. This is acceptable for resilience. However, the daily refresh scheduled via TickLiteral will silently fail without any retry mechanism. Consider implementing exponential backoff for transient failures.


201-207: Row read errors are logged but iteration continues.

When it.Next(&row) fails with an error other than iterator.Done, the error is logged and iteration continues. This is reasonable for resilience, but consider tracking error counts and failing the refresh if too many rows fail to parse.


410-413: Fallback to first container may not reflect actual longest-running workload.

When longestContainer is nil due to missing duration data, the code falls back to the first container. This fallback may apply resource increases to a sidecar if it happens to be listed first. Consider logging this as a warning or using container naming heuristics (e.g., prefer "test" or non-sidecar names).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3adf9b7 and b0b3eec.

📒 Files selected for processing (9)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • .gitignore
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/admission_test.go
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
cmd/pod-scaler/measured.go (1)
pkg/api/types.go (2)
  • Metadata (161-166)
  • ResourceList (261-261)
🔇 Additional comments (11)
cmd/pod-scaler/resources.go (2)

107-134: LGTM on recency filtering and weighting logic.

The time-windowed filtering with progressive weighting (3x/2x/1x based on age) and minimum sample enforcement are well-implemented. The skip logic for insufficient recent data prevents unreliable recommendations.


146-155: LGTM on minimum threshold enforcement.

Clamping CPU to 10 millicores and memory to 10 MiB prevents zero or near-zero recommendations that could cause scheduling or OOM issues.

cmd/pod-scaler/admission.go (4)

199-206: LGTM on function rename and documentation.

The new name applyRecommendationsBasedOnRecentData better reflects the bidirectional nature of the function (can both increase and decrease resources).


224-227: Good addition: skip resources with no recommendation.

This prevents overwriting user-configured values when the pod-scaler has no data for a particular resource.


249-273: Reduction logic applies to the already-inflated value.

The 1.2x multiplier (line 231) is applied to our before the comparison. When cmp < 0, the reduction percentage is calculated against the inflated our value, not the original recommendation. This may result in slightly different reduction behavior than intended (reducing from an already-buffered value).

Verify this is the intended behavior. If reductions should be based on raw recommendations, move the inflation after the comparison logic.


265-269: Integer truncation may cause inconsistent capping.

our.Set(int64(maxAllowed)) truncates the float result. For memory values this is fine, but for CPU this loses precision since Set expects the base unit (cores), not millicores. However, since the quantity was already created as a milli-quantity or byte-quantity, this should still work correctly via the existing format.

cmd/pod-scaler/measured_test.go (3)

63-131: LGTM on TestClassifyPod.

Good coverage of three scenarios: no data (measured), recent data (normal), and nil BigQuery client (defaults to measured). The use of MetadataFor in test case 2 ensures metadata matches what the production code generates.


151-210: LGTM on TestAddPodAntiAffinity.

Tests correctly verify anti-affinity is added for measured and normal pods, and not added for unlabeled pods.


212-253: LGTM on TestMeasuredPodCache.

Tests verify Get returns false for missing data and returns correct values after Update.

cmd/pod-scaler/measured.go (2)

360-370: LGTM on anti-affinity rule merging.

The code correctly preserves existing anti-affinity rules by appending rather than overwriting.


434-448: ApplyMeasuredPodResources is not called in the admission path; no double-buffering issue.

ApplyMeasuredPodResources is not invoked anywhere in the codebase (only defined at line 376 but with zero call sites). The active function in the admission webhook is applyRecommendationsBasedOnRecentData (called at line 337 in admission.go), which applies the 1.2x buffer during pod mutation. Since only one function operates on pods during admission, there is no double-buffering concern.

Likely an incorrect or invalid review comment.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from b0b3eec to 454b866 Compare January 12, 2026 17:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.gitignore:
- Line 13: The .gitignore entry "/job-run-aggregatorpod-scaler" is concatenated
and won't ignore either binary; split it into two separate ignore patterns by
replacing that entry with "/job-run-aggregator" and "/pod-scaler" so both
binaries are matched and not accidentally committed.

In @cmd/pod-scaler/admission_test.go:
- Around line 734-735: The test case name "ours are smaller - should reduce
resources based on recent usage" is duplicated; update the name string in the
test case struct (the `name:` field inside the test table entry that begins with
that text) to a unique, descriptive value (e.g., include the specific scenario
or line reference such as "ours are smaller - should reduce resources based on
recent usage (recent-window case)") so test output can unambiguously identify
which case failed. Ensure only the `name` field is changed and keep the rest of
the test case data and assertions intact.

In @cmd/pod-scaler/measured.go:
- Around line 148-152: The TODO claiming the BigQuery string is a "placeholder
query" is outdated; update the comment around the query := bq.client.Query(...)
to either remove the TODO entirely or replace it with a clarifying note that the
query is intentionally implemented and depends on the external pod_metrics
BigQuery table schema, and mention the corresponding row struct (the struct with
bigquery field tags used to map results) so readers know the schema dependency;
no code changes required beyond editing the comment near query and the row
struct remark.
🧹 Nitpick comments (4)
cmd/pod-scaler/resources.go (1)

173-180: Consider logging a warning on invalid environment variable values.

Currently, if POD_SCALER_MIN_SAMPLES contains an invalid value (non-numeric or <= 0), the function silently falls back to 3. This could mask configuration errors in production. A debug/warning log would help operators identify misconfigurations.

♻️ Optional: Add logging for invalid values
 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
 		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
 			return minSamples
+		} else {
+			logrus.WithField("POD_SCALER_MIN_SAMPLES", val).Warn("Invalid value for POD_SCALER_MIN_SAMPLES, using default of 3")
 		}
 	}
 	return 3
 }
cmd/pod-scaler/resources_test.go (1)

90-93: Consider test isolation from environment.

The tests implicitly rely on minSamplesForRecommendation being 3 (the default). If POD_SCALER_MIN_SAMPLES is set in the test environment, behavior could differ. Since minSamplesForRecommendation is initialized at package load, consider using t.Setenv("POD_SCALER_MIN_SAMPLES", "3") at the start of each test to ensure isolation, or document this dependency.

cmd/pod-scaler/admission_test.go (1)

572-572: Consider renaming test function to match the renamed function.

The test function is still named TestUseOursIfLarger but it now tests applyRecommendationsBasedOnRecentData. For consistency and discoverability, consider renaming to TestApplyRecommendationsBasedOnRecentData.

Also applies to: 774-774

cmd/pod-scaler/admission.go (1)

199-206: Doc comment references non-existent test function name.

The comment mentions TestApplyRecommendationsBasedOnRecentData_ReducesResources but the actual test is TestUseOursIfLarger. Update the comment to match the actual test name.

Suggested fix
 // applyRecommendationsBasedOnRecentData applies resource recommendations based on recent usage data
 // (see resourceRecommendationWindow). If they used more, we increase resources. If they used less,
 // we decrease them (pod-scaler is always authoritative).
 //
-// TestApplyRecommendationsBasedOnRecentData_ReducesResources is tested in admission_test.go
-// as part of TestUseOursIfLarger. The reduction functionality is verified there with proper
-// test cases that handle ResourceQuantity comparison correctly.
+// The reduction functionality is tested in admission_test.go as part of TestUseOursIfLarger
+// with test cases that verify ResourceQuantity comparison and the 25% reduction cap.
 func applyRecommendationsBasedOnRecentData(allOfOurs, allOfTheirs *corev1.ResourceRequirements, workloadName, workloadType string, reporter results.PodScalerReporter, logger *logrus.Entry) {
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b0b3eec and 454b866.

📒 Files selected for processing (11)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e/pod-scaler/run/producer.go
  • cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧬 Code graph analysis (5)
cmd/pod-scaler/resources_test.go (2)
pkg/pod-scaler/types.go (3)
  • FullMetadata (374-385)
  • CachedQuery (53-68)
  • FingerprintTime (71-76)
pkg/api/types.go (2)
  • Metadata (161-166)
  • ResourceRequirements (246-253)
cmd/pod-scaler/admission_test.go (2)
pkg/api/types.go (2)
  • ResourceList (261-261)
  • ResourceRequirements (246-253)
pkg/testhelper/accessory.go (1)
  • T (51-57)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (19)
test/e2e/pod-scaler/run/consumer.go (1)

71-71: LGTM!

Setting POD_SCALER_MIN_SAMPLES=1 for e2e tests is appropriate since test data typically has fewer samples than production. This correctly aligns with the new environment-configurable threshold introduced in resources.go.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)

15-18: LGTM!

Test fixture updates correctly reflect the new resource recommendation values with the updated structured representation including string annotations.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)

16-24: LGTM!

Fixture values correctly updated to match the new resource recommendation behavior with recency-weighted calculations.

cmd/pod-scaler/resources.go (3)

42-66: LGTM! Well-documented constants and configurable threshold.

The 3-week recommendation window, minimum resource thresholds, and environment-configurable sample count are well-designed. The inline documentation clearly explains the rationale for each constant.


107-134: LGTM! Solid recency-based filtering with tiered weighting.

The filtering logic correctly excludes data outside the 3-week window and applies sensible weights (3x/2x/1x) based on data age. The minimum sample threshold guard prevents recommendations based on insufficient data.


146-155: LGTM! Good defensive clamping.

Applying minimum thresholds after quantile calculation prevents recommending zero or extremely small values that could cause scheduling failures.

cmd/pod-scaler/resources_test.go (2)

18-127: LGTM! Good test coverage for recency-based filtering.

The test correctly validates that old data (4 weeks ago) is excluded while recent data (1 week ago) is used for recommendations. Providing exactly 3 recent samples meets the default minSamplesForRecommendation threshold.


129-191: LGTM! Good edge case test.

Correctly validates that no recommendation is created when all data falls outside the 3-week window.

cmd/pod-scaler/admission_test.go (2)

664-699: LGTM!

The test case correctly verifies the 25% reduction cap behavior. The inline comments accurately explain the math, and the expected values are correct.


722-732: LGTM!

The mixed scenario test correctly verifies that larger values get the 1.2x increase while smaller values get the 25% reduction cap applied.

cmd/pod-scaler/admission.go (2)

249-273: LGTM!

The reduction logic correctly implements gradual resource reduction with safety limits:

  • Skips trivial changes (<5%)
  • Caps aggressive reductions at 25% per cycle
  • Properly handles the zero case

The 25% cap per cycle is a sensible safeguard against sudden resource starvation.


224-227: LGTM!

Good defensive check to skip resources where no recommendation exists, preventing accidental clearing of configured values.

cmd/pod-scaler/measured.go (7)

65-95: LGTM!

The cache implementation is thread-safe with appropriate read/write lock usage. The atomic map replacement in Update is clean and avoids partial update issues.


128-131: Verify: Startup failure leaves cache empty, causing all pods to be measured.

When the initial BigQuery refresh fails, the service continues with an empty cache. Per line 294-296, this causes all pods to default to "measured" classification. While this is a safe default (pods get isolated), it may cause scheduling pressure if BigQuery is temporarily unavailable. Consider whether this behavior is intentional.


201-207: LGTM!

Continuing on individual row errors is resilient and appropriate. Consider adding a counter for failed rows if debugging becomes needed in the future.


308-371: LGTM!

The anti-affinity logic correctly:

  • Isolates measured pods from all other pod-scaler workloads
  • Allows normal pods to share nodes while avoiding measured pods
  • Preserves existing affinity rules by merging

376-386: Clarify: This applies resources to "measured" labeled pods, but those are pods being measured, not pods with prior data.

The function applies resource recommendations to pods labeled PodScalerLabelValueMeasured. However, per the classification logic in ClassifyPod (lines 284-297), pods are labeled "measured" when they lack recent measured data and need to be measured. This seems contradictory - you're applying measured data to pods that don't have measured data yet.

Should this be checking for PodScalerLabelValueNormal instead (pods that have recent measured data)?


410-418: LGTM!

The fallback to the first container when duration data is unavailable is reasonable. The subsequent check for measured data (lines 423-427) handles the case where no data exists.


274-306: LGTM!

The classification logic correctly marks pods as "measured" when any container needs fresh measurement data, and defaults safely when BigQuery is unavailable.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 454b866 to 7ef32ad Compare January 12, 2026 18:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @.gitignore:
- Line 13: The .gitignore line 'job-run-aggregatorpod-scaler' is a
concatenated/merge-messed entry; replace it with two separate ignore entries
(i.e., split the concatenated token into the two intended paths) by locating the
'job-run-aggregatorpod-scaler' entry and adding each intended path as its own
line (verify the correct directory names in the repo—e.g., job-run-aggregator
and aggregator-pod-scaler or the actual folder names) so each ignore pattern is
on a separate line.

In @cmd/pod-scaler/admission_test.go:
- Around line 734-770: The two table-driven test entries share the identical
name "ours are smaller - should reduce resources based on recent usage", causing
one to be shadowed; locate the second test case in admission_test.go that
defines the `ours`, `theirs`, and `expected` ResourceRequirements (the block
with Limits/Requests showing CPU 50->150 and memory 1e9->2.25e10 etc.) and give
its `name` field a unique string (e.g. append a qualifier like "(memory-heavy)"
or "(reduction cap)"), so each test row is distinct in output.
🧹 Nitpick comments (1)
cmd/pod-scaler/resources.go (1)

169-180: Consider logging when env var override is applied.

The implementation correctly handles invalid values by falling back to the default. However, it might be helpful to log when the override is active, especially since this affects production behavior.

Optional: Add logging for env var override
 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
 		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
+			logrus.Infof("Using POD_SCALER_MIN_SAMPLES override: %d", minSamples)
 			return minSamples
 		}
+		logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value: %q, using default", val)
 	}
 	return 3
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 454b866 and 7ef32ad.

📒 Files selected for processing (9)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/resources.go
  • test/e2e/pod-scaler/run/producer.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
🧬 Code graph analysis (3)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
cmd/pod-scaler/admission.go (2)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
cmd/pod-scaler/admission_test.go (1)
pkg/api/types.go (2)
  • ResourceList (261-261)
  • ResourceRequirements (246-253)
🔇 Additional comments (11)
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)

1-90: Test fixture updates align with the new resource reduction logic.

The fixture changes correctly reflect the new applyRecommendationsBasedOnRecentData behavior including:

  • CPU limits being removed (per reconcileLimits)
  • Resource values adjusted according to the 25% max reduction cap
  • Minimum threshold enforcement
test/e2e/pod-scaler/run/consumer.go (1)

71-71: LGTM!

Setting POD_SCALER_MIN_SAMPLES=1 for e2e tests appropriately bypasses the production minimum sample requirement, allowing tests to run with limited test data.

test/e2e/pod-scaler/run/producer.go (1)

31-31: LGTM!

Consistent with the consumer setup, correctly setting the minimum samples override for e2e tests while preserving the existing environment.

cmd/pod-scaler/admission_test.go (1)

772-779: LGTM!

Test cases comprehensively cover the new reduction logic including:

  • 25% max reduction cap per cycle
  • The 1.2x multiplier applied to recommendations
  • Both increase and decrease scenarios

The function rename from useOursIfLarger to applyRecommendationsBasedOnRecentData correctly reflects the expanded bidirectional behavior.

cmd/pod-scaler/admission.go (4)

199-206: LGTM!

Good function rename reflecting the new bidirectional behavior. The doc comments clearly explain that the scaler is now authoritative for both increases and decreases.


224-227: LGTM!

The zero-value check correctly prevents applying empty recommendations, which could happen when there's no recent data for this resource type.


249-273: Sound gradual reduction implementation.

The logic correctly implements:

  • 5% minimum threshold to avoid micro-adjustments
  • 25% maximum reduction cap per cycle for safety
  • Proper handling of zero-value edge case

One consideration: using AsApproximateFloat64() is acceptable here since you're computing percentages for thresholds, not exact equality comparisons. The potential precision loss is negligible for these safety checks.


337-337: LGTM!

Call site correctly updated to use the renamed function.

cmd/pod-scaler/resources.go (3)

45-66: Well-designed constants and configurable minimum samples.

The 3-week window, minimum thresholds, and configurable sample requirement provide good defaults with appropriate flexibility for testing. The documentation comments clearly explain the rationale.


107-133: Effective time-windowing with recency weighting.

The weighting scheme (3x for past week, 2x for 1-2 weeks, 1x for 2-3 weeks) appropriately emphasizes recent data while still considering slightly older patterns. The minimum sample requirement prevents unreliable recommendations from sparse data.

One note: the Merge operation with weighted repetition is a valid approach for histogram aggregation - each Merge adds the histogram's counts, effectively giving more weight to recent samples.


146-155: LGTM!

Minimum threshold enforcement correctly prevents dangerously low recommendations. Applied after quantile calculation ensures the safety floor is always respected.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 7ef32ad to 445193c Compare January 12, 2026 18:15
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 12, 2026

@deepsm007: This pull request references DPTP-2938 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Add flags to enable/disable authoritative mode for CPU and memory resource requests separately.
The pod-scaler can now decrease resource requests when authoritative mode is enabled (default: true for both), allowing gradual resource reduction based on recent usage data with safety limits (max 25% per cycle).

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/pod-scaler/admission_test.go (1)

664-699: Duplicate test case names will cause confusion.

Both test cases at lines 664-699 and 734-770 have the identical name "ours are smaller - should reduce resources based on recent usage". This makes test output ambiguous and hinders debugging. Rename one to distinguish the scenarios (e.g., include the specific values being tested).

Also applies to: 734-770

🧹 Nitpick comments (2)
cmd/pod-scaler/resources.go (2)

60-66: Package-level initialization may cause test isolation issues.

minSamplesForRecommendation is initialized once at package load time via getMinSamplesForRecommendation(). If tests need different values, they cannot easily override this after import. Consider making it a function call at usage sites or providing a setter for test configurability.


173-180: Error handling for invalid env var values is silent.

If POD_SCALER_MIN_SAMPLES is set to a non-integer or negative value, the function silently falls back to 3. Consider logging a warning when the env var is present but invalid to aid debugging configuration issues.

♻️ Suggested improvement
 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
 		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
 			return minSamples
 		}
+		logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default of 3", val)
 	}
 	return 3
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef32ad and 445193c.

📒 Files selected for processing (10)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • cmd/pod-scaler/resources_test.go
  • .gitignore
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
🧬 Code graph analysis (3)
cmd/pod-scaler/admission_test.go (2)
pkg/api/types.go (2)
  • ResourceList (261-261)
  • ResourceRequirements (246-253)
pkg/testhelper/accessory.go (1)
  • T (51-57)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
🔇 Additional comments (9)
cmd/pod-scaler/admission.go (3)

36-47: LGTM on admission wiring updates.

The new authoritative flags are properly threaded through the admit function and podMutator struct initialization.


265-288: Gradual reduction logic is sound.

The 25% max reduction per cycle and 5% minimum difference threshold provide safe guardrails. The logic correctly handles the edge case where theirValue is zero.


343-362: LGTM on mutatePodResources signature update.

The function properly propagates the new authoritative flags to the resource application logic.

cmd/pod-scaler/main.go (2)

94-95: Verify that defaulting authoritative mode to true is intentional.

Both --authoritative-cpu and --authoritative-memory default to true, meaning resource reductions will be applied immediately upon deployment. Confirm this is the desired rollout strategy, as it changes behavior for all existing workloads.


275-276: LGTM on admit call site update.

The new flags are correctly passed through to the admission handler.

cmd/pod-scaler/admission_test.go (1)

557-557: LGTM on mutatePodResources test call update.

The test correctly passes the new authoritative flags.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)

1-90: LGTM on fixture update.

The updated expected values correctly reflect the new resource recommendation logic with the 120% experimental bump and minimum thresholds.

cmd/pod-scaler/resources.go (2)

101-134: Time-based filtering and weighting logic is well-implemented.

The recency-based weighting (3x for past week, 2x for 1-2 weeks, 1x for 2-3 weeks) appropriately prioritizes recent usage patterns. The early exit when insufficient samples exist prevents unreliable recommendations.


146-155: LGTM on minimum threshold enforcement.

Clamping CPU to 10 millicores and memory to 10 MiB prevents scheduling issues from extremely small recommendations.

@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 445193c to 059588d Compare January 12, 2026 18:30
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
cmd/pod-scaler/resources_test.go (1)

119-126: Consider tightening the expected value range.

The assertion bounds (70000-100000 millicores) are quite wide for histogram values of 50, 60, and 75. At the 0.8 quantile, the expected value should be closer to 70-75 cores. The current upper bound of 100000 may not catch regressions where old data inadvertently influences the result.

Additionally, the assertion at line 124 (> 120000) may never trigger since it's already covered by the condition at line 119 (> 100000).

♻️ Suggested tighter bounds
-	if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 {
-		t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli)
-	}
-
-	// Verify it's not based on the old high value (100.0)
-	if cpuRequestMilli > 120000 {
-		t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli)
-	}
+	// With values 50, 60, 75 at 0.8 quantile with weighting, expect approximately 70-80 cores
+	if cpuRequestMilli < 60000 || cpuRequestMilli > 85000 {
+		t.Errorf("Expected CPU request to be based on recent data (around 60000-85000 millicores), got %d millicores", cpuRequestMilli)
+	}
cmd/pod-scaler/resources.go (1)

173-180: Consider logging a warning for invalid POD_SCALER_MIN_SAMPLES values.

When POD_SCALER_MIN_SAMPLES is set to an invalid value (non-integer or ≤0), the function silently falls back to the default. This could mask configuration errors.

♻️ Suggested improvement
 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
 		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
 			return minSamples
+		} else {
+			logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q, using default of 3", val)
 		}
 	}
 	return 3
 }
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 445193c and 059588d.

📒 Files selected for processing (10)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/main.go
  • test/e2e/pod-scaler/run/producer.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
🧬 Code graph analysis (1)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
🔇 Additional comments (19)
test/e2e/pod-scaler/run/consumer.go (1)

71-71: LGTM!

The addition of POD_SCALER_MIN_SAMPLES=1 is appropriate for e2e testing, allowing the pod-scaler to make recommendations with minimal sample data rather than requiring the default threshold.

test/e2e/pod-scaler/run/producer.go (1)

8-8: LGTM!

The environment variable setup is consistent with the consumer configuration and correctly inherits the existing environment before appending the test-specific setting.

Also applies to: 31-31

cmd/pod-scaler/resources_test.go (2)

18-127: Test coverage for time-window filtering looks good.

The test effectively validates that old data (4 weeks ago) is filtered out while recent data (1 week ago) is used for recommendations.


129-191: LGTM!

Clear and effective test validating that no recommendations are produced when all data is outside the 3-week time window.

cmd/pod-scaler/admission_test.go (4)

557-557: LGTM!

The call site correctly passes the new authoritative flags (true, true) to enable the gradual reduction behavior in tests.


664-699: Test case accurately reflects the 25% max reduction per cycle logic.

The expected values correctly demonstrate:

  • CPU: 200 * 0.75 = 150 (limits), 100 * 0.75 = 75 (requests)
  • Memory: 3e10 * 0.75 = 2.25e10 (limits), 2e10 * 0.75 = 1.5e10 (requests)

This validates the safety mechanism preventing drastic reductions in a single cycle.


734-770: New test case provides good coverage for medium-value scenarios.

Correctly validates that even moderate differences between recommended and current resources are capped at the 25% reduction limit when the calculated reduction would exceed that threshold.


772-779: LGTM!

Function rename from useOursIfLarger to applyRecommendationsBasedOnRecentData better reflects the new bidirectional behavior (can both increase and decrease resources).

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)

1-90: Fixture correctly updated to reflect new resource recommendation behavior.

The changes demonstrate:

  • Gradual CPU reductions (respecting the 25% cap)
  • Memory normalization consistent with minimum thresholds
  • Proper format string representations for resources

These expectations align with the applyRecommendationsBasedOnRecentData implementation and authoritative mode flags.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)

1-68: LGTM!

The test fixture updates correctly reflect the new resource recommendation logic with typed resource.int64Amount values and their string representations. The adjusted CPU and memory values align with the implementation changes in the admission and resources modules.

cmd/pod-scaler/main.go (3)

70-72: LGTM!

The new authoritativeCPU and authoritativeMemory fields are properly added to consumerOptions with clear naming that reflects their purpose.


94-95: LGTM!

The CLI flags are well-defined with sensible defaults (true) and clear descriptions explaining that authoritative mode allows decreasing resources based on recent usage. The default-on behavior aligns with the PR objective to enable gradual reductions.


275-276: LGTM!

The new flags are correctly passed through to the admit() function.

cmd/pod-scaler/resources.go (2)

45-66: LGTM!

The constants for the 3-week recommendation window and minimum resource floors are well-documented and provide sensible safeguards. The environment variable override for minSamplesForRecommendation is useful for testing flexibility.


107-134: LGTM!

The time-based filtering and weighted merging logic is well-structured. The weighting scheme (3x for <7 days, 2x for 7-14 days, 1x for 14-21 days) appropriately prioritizes recent data while still considering older samples within the window. The early-exit conditions for insufficient data prevent unreliable recommendations.

cmd/pod-scaler/admission.go (4)

36-36: LGTM!

The function signature is correctly extended to include the new authoritative mode flags.


63-65: LGTM!

The podMutator struct properly includes the new fields for per-resource authoritative control.


201-294: Solid implementation of gradual resource reduction with appropriate safeguards.

The logic correctly implements:

  • Per-resource authoritative mode checks
  • Noise reduction by skipping <5% adjustments
  • Safety limit capping reductions at 25% per cycle
  • Proper handling of zero-value edge cases

The 20% increase factor (line 232-233) is noted as experimental via the existing TODO comment.


343-362: LGTM!

The mutatePodResources function correctly propagates the authoritative flags through to the recommendation application logic.

@jupierce
Copy link
Contributor

/hold
This defaults to "on" right now for authoritativeCPU. We know there is a significant issue here until measurements do not pods running on CPU pressured worker nodes.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 12, 2026
@deepsm007 deepsm007 force-pushed the pod-scaler-improvements branch from 059588d to 7607680 Compare January 12, 2026 19:19
@deepsm007
Copy link
Contributor Author

/hold This defaults to "on" right now for authoritativeCPU. We know there is a significant issue here until measurements do not pods running on CPU pressured worker nodes.

Updated default to false "off"

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/pod-scaler/resources.go (1)

98-159: Avoid potential panic: guard missing/nil histograms before calling .Histogram() and merging.
data.Data[fingerprintTime.Fingerprint] is assumed present; if it’s missing (or nil), this will crash the scaler.

Proposed fix
 func (s *resourceServer) digestData(data *podscaler.CachedQuery, quantile float64, request corev1.ResourceName, quantity toQuantity) {
 	logger := s.logger.WithField("resource", request)
 	logger.Debugf("Digesting %d identifiers.", len(data.DataByMetaData))
-	cutoffTime := time.Now().Add(-resourceRecommendationWindow)
-	now := time.Now()
+	now := time.Now()
+	cutoffTime := now.Add(-resourceRecommendationWindow)
 	for meta, fingerprintTimes := range data.DataByMetaData {
 		overall := circonusllhist.New()
 		metaLogger := logger.WithField("meta", meta)
 		metaLogger.Tracef("digesting %d fingerprints", len(fingerprintTimes))
 		recentCount := 0
 		for _, fingerprintTime := range fingerprintTimes {
 			if fingerprintTime.Added.After(cutoffTime) {
-				hist := data.Data[fingerprintTime.Fingerprint].Histogram()
+				h, ok := data.Data[fingerprintTime.Fingerprint]
+				if !ok || h == nil {
+					metaLogger.Warnf("missing histogram for fingerprint %v, skipping", fingerprintTime.Fingerprint)
+					continue
+				}
+				hist := h.Histogram()
 				// Weight more recent data more heavily to make the scaler more sensitive to recent runs.
 				// Past week: 3x weight, 1-2 weeks ago: 2x weight, 2-3 weeks ago: 1x weight.
 				age := now.Sub(fingerprintTime.Added)
 				weight := 1
 				if age < 7*24*time.Hour {
 					weight = 3
 				} else if age < 14*24*time.Hour {
 					weight = 2
 				}
 				for i := 0; i < weight; i++ {
 					overall.Merge(hist)
 				}
 				recentCount++
 			}
 		}
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/resources_test.go:
- Around line 18-127: The test depends on the POD_SCALER_MIN_SAMPLES env and
current window, so make it deterministic by setting the server's minSamples and
resourceRecommendationWindow directly before calling digestData: set
resourceServer.minSamples = 3 and resourceServer.resourceRecommendationWindow =
14*24*time.Hour (so fourWeeksAgo is excluded and oneWeekAgo included), then call
server.digestData(...) and assert via server.recommendedRequestFor(meta) that
the recommendation reflects only the recent fingerprints; this ensures old data
is excluded regardless of environment.
🧹 Nitpick comments (4)
cmd/pod-scaler/resources.go (1)

60-66: Config via env is fine for e2e, but make it test/ops-friendly: log invalid values and allow runtime override.
Right now invalid POD_SCALER_MIN_SAMPLES silently falls back to 3, and the value is captured at init (harder to vary in unit tests / long-running processes).

Also applies to: 170-180

cmd/pod-scaler/resources_test.go (1)

129-191: Also de-flake this test: pin minSamplesForRecommendation so ambient env can’t change behavior.
Right now, an externally-set POD_SCALER_MIN_SAMPLES could cause this test to fail/behave unexpectedly because the package-level var is initialized from env.

cmd/pod-scaler/admission_test.go (2)

554-568: Add coverage for authoritative=false (especially CPU) to match real default behavior

mutatePodResources(..., true, true, ...) hard-codes authoritative mode on in all cases here, so the test suite won’t catch regressions where reductions must be skipped when authoritativeCPU/authoritativeMemory are disabled. Consider adding at least one case for each flag being false (CPU-only false is particularly important given the CLI default).


664-779: Add a millicore CPU test case to guard Quantity rounding/truncation

The new reduction scenarios are good, but they all use whole-number CPU quantities. Please add a case where CPU is in millicores (e.g., configured 500m, recommended 200m) so the tests fail if the implementation accidentally truncates fractional cores during the 1.2x headroom or 25% reduction clamp.

Also applies to: 857-865

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 059588d and 7607680.

📒 Files selected for processing (10)
  • .gitignore
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e/pod-scaler/run/producer.go
  • .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/resources_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • cmd/pod-scaler/resources.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
🔇 Additional comments (7)
test/e2e/pod-scaler/run/consumer.go (1)

66-72: Good: e2e explicitly sets POD_SCALER_MIN_SAMPLES=1 to avoid window/sample-count coupling.
This should make e2e less brittle given the 3-week filtering + min-sample guard.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)

13-19: Sanity-check fixture expectations against the “max 25% reduction per cycle” rule (rounding can break caps).
Given these fixtures now encode new steady-state requests, please ensure the underlying capping logic can’t be violated by quantity rounding (e.g., whole cores / Gi rounding).

Also applies to: 31-37, 50-55

cmd/pod-scaler/main.go (2)

259-276: Good: flags are correctly plumbed into admission path
mainAdmission passes both opts.authoritativeCPU and opts.authoritativeMemory into admit(...), keeping wiring consistent end-to-end.


60-97: Help text adequately documents the authoritative-cpu default; clarify authoritative-memory default instead

The --authoritative-cpu=false default is documented in the code via help text explaining it defaults to false due to CPU measurements being affected by node contention. However, the asymmetry is worth addressing: --authoritative-memory=true defaults to true without explaining why. If the memory default is intentionally permissive (or has different reasoning), document that reasoning in the help text alongside the CPU explanation for consistency.

cmd/pod-scaler/admission.go (3)

36-48: LGTM: authoritative flags are plumbed into the webhook handler
Signature/struct wiring looks consistent: main.go -> admit -> podMutator -> mutatePodResources.

Also applies to: 54-66, 88-104


343-362: Design fragility: recommendation limits are implicitly prevented, not explicitly enforced

The concern about mutateResourceLimits=false being bypassed doesn't currently exist in practice because digestData only populates the Requests field of recommendations—the Limits field is initialized but never written to. Consequently, when applyRecommendationsBasedOnRecentData processes limits, the if our.IsZero() check skips them.

However, this safety relies on implicit contract enforcement in digestData. For maintainability, explicitly document that recommendations never include limits, or add an explicit guard in mutatePodResources that skips the recommendations→limits path unless mutateResourceLimits is enabled.

Likely an incorrect or invalid review comment.


201-294: [rewritten review comment]
[classification tag]

Comment on lines +18 to +127
func TestDigestData_FiltersOldData(t *testing.T) {
now := time.Now()
fourWeeksAgo := now.Add(-28 * 24 * time.Hour)
oneWeekAgo := now.Add(-7 * 24 * time.Hour)

// Create test data with fingerprints at different times
oldFingerprint := model.Fingerprint(1)
recentFingerprint1 := model.Fingerprint(2)
recentFingerprint2 := model.Fingerprint(3)
recentFingerprint3 := model.Fingerprint(4)

// Create histograms with different values so we can verify which ones are used
oldHist := circonusllhist.New()
if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored
t.Fatalf("failed to record value: %v", err)
}

recentHist1 := circonusllhist.New()
if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used
t.Fatalf("failed to record value: %v", err)
}

recentHist2 := circonusllhist.New()
if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used
t.Fatalf("failed to record value: %v", err)
}

recentHist3 := circonusllhist.New()
if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used
t.Fatalf("failed to record value: %v", err)
}

meta := podscaler.FullMetadata{
Metadata: api.Metadata{
Org: "test-org",
Repo: "test-repo",
Branch: "main",
},
Container: "test-container",
}

data := &podscaler.CachedQuery{
Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{
oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist),
recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1),
recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2),
recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3),
},
DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{
meta: {
// Old data - should be filtered out
{
Fingerprint: oldFingerprint,
Added: fourWeeksAgo,
},
// Recent data - should be included (need at least 3 samples)
{
Fingerprint: recentFingerprint1,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint2,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint3,
Added: oneWeekAgo,
},
},
},
}

server := &resourceServer{
logger: logrus.WithField("test", "TestDigestData_FiltersOldData"),
byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{},
}

// The digestData function uses time.Now() internally to calculate the cutoff.
// Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function
// will correctly filter based on the current time and resourceRecommendationWindow.

// Digest the data
server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity {
return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI)
})

// Verify that the recommendation was created
recommendation, exists := server.recommendedRequestFor(meta)
if !exists {
t.Fatal("Expected recommendation to exist, but it doesn't")
}

// The recommendation should be based on recent data only (50.0, 75.0, 60.0)
// At 0.8 quantile with weighted averaging (3x weight for past week), it should be
// closer to 75.0 than 100.0. Note: digestData returns the raw quantile value;
// the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData.
cpuRequest := recommendation.Requests[corev1.ResourceCPU]
cpuRequestMilli := cpuRequest.MilliValue()

// The value should be based on recent data (around 70-100 cores worth of millicores)
// We allow some variance due to histogram quantization
if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 {
t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli)
}

// Verify it's not based on the old high value (100.0)
if cpuRequestMilli > 120000 {
t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tighten the test: make it independent of POD_SCALER_MIN_SAMPLES env and actually prove old data is excluded.

Proposed fix
 package main

 import (
 	"testing"
 	"time"
@@
 func TestDigestData_FiltersOldData(t *testing.T) {
+	// Avoid test behavior depending on the ambient environment.
+	origMin := minSamplesForRecommendation
+	minSamplesForRecommendation = 3
+	t.Cleanup(func() { minSamplesForRecommendation = origMin })
+
 	now := time.Now()
 	fourWeeksAgo := now.Add(-28 * 24 * time.Hour)
-	oneWeekAgo := now.Add(-7 * 24 * time.Hour)
+	// Keep this comfortably within the "past week" bucket to avoid boundary effects.
+	oneWeekAgo := now.Add(-6 * 24 * time.Hour)
@@
-	oldHist := circonusllhist.New()
-	if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored
+	oldHist := circonusllhist.New()
+	if err := oldHist.RecordValue(1000.0); err != nil { // Much higher value - must be ignored
 		t.Fatalf("failed to record value: %v", err)
 	}
@@
-	// The value should be based on recent data (around 70-100 cores worth of millicores)
-	// We allow some variance due to histogram quantization
-	if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 {
-		t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli)
-	}
-
-	// Verify it's not based on the old high value (100.0)
-	if cpuRequestMilli > 120000 {
-		t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli)
-	}
+	// Should be driven by recent values (~75 cores => ~75000m), not the old 1000-core spike.
+	if cpuRequestMilli < 60000 || cpuRequestMilli > 120000 {
+		t.Errorf("Expected CPU request to be based on recent data (~60000-120000m), got %d millicores", cpuRequestMilli)
+	}
+	if cpuRequestMilli > 200000 {
+		t.Errorf("CPU request appears to include old data (1000.0), got %d millicores", cpuRequestMilli)
+	}
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestDigestData_FiltersOldData(t *testing.T) {
now := time.Now()
fourWeeksAgo := now.Add(-28 * 24 * time.Hour)
oneWeekAgo := now.Add(-7 * 24 * time.Hour)
// Create test data with fingerprints at different times
oldFingerprint := model.Fingerprint(1)
recentFingerprint1 := model.Fingerprint(2)
recentFingerprint2 := model.Fingerprint(3)
recentFingerprint3 := model.Fingerprint(4)
// Create histograms with different values so we can verify which ones are used
oldHist := circonusllhist.New()
if err := oldHist.RecordValue(100.0); err != nil { // High value - should be ignored
t.Fatalf("failed to record value: %v", err)
}
recentHist1 := circonusllhist.New()
if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used
t.Fatalf("failed to record value: %v", err)
}
recentHist2 := circonusllhist.New()
if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used
t.Fatalf("failed to record value: %v", err)
}
recentHist3 := circonusllhist.New()
if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used
t.Fatalf("failed to record value: %v", err)
}
meta := podscaler.FullMetadata{
Metadata: api.Metadata{
Org: "test-org",
Repo: "test-repo",
Branch: "main",
},
Container: "test-container",
}
data := &podscaler.CachedQuery{
Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{
oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist),
recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1),
recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2),
recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3),
},
DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{
meta: {
// Old data - should be filtered out
{
Fingerprint: oldFingerprint,
Added: fourWeeksAgo,
},
// Recent data - should be included (need at least 3 samples)
{
Fingerprint: recentFingerprint1,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint2,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint3,
Added: oneWeekAgo,
},
},
},
}
server := &resourceServer{
logger: logrus.WithField("test", "TestDigestData_FiltersOldData"),
byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{},
}
// The digestData function uses time.Now() internally to calculate the cutoff.
// Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function
// will correctly filter based on the current time and resourceRecommendationWindow.
// Digest the data
server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity {
return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI)
})
// Verify that the recommendation was created
recommendation, exists := server.recommendedRequestFor(meta)
if !exists {
t.Fatal("Expected recommendation to exist, but it doesn't")
}
// The recommendation should be based on recent data only (50.0, 75.0, 60.0)
// At 0.8 quantile with weighted averaging (3x weight for past week), it should be
// closer to 75.0 than 100.0. Note: digestData returns the raw quantile value;
// the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData.
cpuRequest := recommendation.Requests[corev1.ResourceCPU]
cpuRequestMilli := cpuRequest.MilliValue()
// The value should be based on recent data (around 70-100 cores worth of millicores)
// We allow some variance due to histogram quantization
if cpuRequestMilli < 70000 || cpuRequestMilli > 100000 {
t.Errorf("Expected CPU request to be based on recent data (around 70000-100000 millicores, i.e., 70-100 cores), got %d millicores", cpuRequestMilli)
}
// Verify it's not based on the old high value (100.0)
if cpuRequestMilli > 120000 {
t.Errorf("CPU request appears to be based on old data (100.0), got %d millicores", cpuRequestMilli)
}
}
func TestDigestData_FiltersOldData(t *testing.T) {
// Avoid test behavior depending on the ambient environment.
origMin := minSamplesForRecommendation
minSamplesForRecommendation = 3
t.Cleanup(func() { minSamplesForRecommendation = origMin })
now := time.Now()
fourWeeksAgo := now.Add(-28 * 24 * time.Hour)
// Keep this comfortably within the "past week" bucket to avoid boundary effects.
oneWeekAgo := now.Add(-6 * 24 * time.Hour)
// Create test data with fingerprints at different times
oldFingerprint := model.Fingerprint(1)
recentFingerprint1 := model.Fingerprint(2)
recentFingerprint2 := model.Fingerprint(3)
recentFingerprint3 := model.Fingerprint(4)
// Create histograms with different values so we can verify which ones are used
oldHist := circonusllhist.New()
if err := oldHist.RecordValue(1000.0); err != nil { // Much higher value - must be ignored
t.Fatalf("failed to record value: %v", err)
}
recentHist1 := circonusllhist.New()
if err := recentHist1.RecordValue(50.0); err != nil { // Lower value - should be used
t.Fatalf("failed to record value: %v", err)
}
recentHist2 := circonusllhist.New()
if err := recentHist2.RecordValue(75.0); err != nil { // Medium value - should be used
t.Fatalf("failed to record value: %v", err)
}
recentHist3 := circonusllhist.New()
if err := recentHist3.RecordValue(60.0); err != nil { // Another recent value - should be used
t.Fatalf("failed to record value: %v", err)
}
meta := podscaler.FullMetadata{
Metadata: api.Metadata{
Org: "test-org",
Repo: "test-repo",
Branch: "main",
},
Container: "test-container",
}
data := &podscaler.CachedQuery{
Data: map[model.Fingerprint]*circonusllhist.HistogramWithoutLookups{
oldFingerprint: circonusllhist.NewHistogramWithoutLookups(oldHist),
recentFingerprint1: circonusllhist.NewHistogramWithoutLookups(recentHist1),
recentFingerprint2: circonusllhist.NewHistogramWithoutLookups(recentHist2),
recentFingerprint3: circonusllhist.NewHistogramWithoutLookups(recentHist3),
},
DataByMetaData: map[podscaler.FullMetadata][]podscaler.FingerprintTime{
meta: {
// Old data - should be filtered out
{
Fingerprint: oldFingerprint,
Added: fourWeeksAgo,
},
// Recent data - should be included (need at least 3 samples)
{
Fingerprint: recentFingerprint1,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint2,
Added: oneWeekAgo,
},
{
Fingerprint: recentFingerprint3,
Added: oneWeekAgo,
},
},
},
}
server := &resourceServer{
logger: logrus.WithField("test", "TestDigestData_FiltersOldData"),
byMetaData: map[podscaler.FullMetadata]corev1.ResourceRequirements{},
}
// The digestData function uses time.Now() internally to calculate the cutoff.
// Since we're using actual timestamps (fourWeeksAgo and oneWeekAgo), the function
// will correctly filter based on the current time and resourceRecommendationWindow.
// Digest the data
server.digestData(data, 0.8, corev1.ResourceCPU, func(valueAtQuantile float64) *resource.Quantity {
return resource.NewMilliQuantity(int64(valueAtQuantile*1000), resource.DecimalSI)
})
// Verify that the recommendation was created
recommendation, exists := server.recommendedRequestFor(meta)
if !exists {
t.Fatal("Expected recommendation to exist, but it doesn't")
}
// The recommendation should be based on recent data only (50.0, 75.0, 60.0)
// At 0.8 quantile with weighted averaging (3x weight for past week), it should be
// closer to 75.0 than 1000.0. Note: digestData returns the raw quantile value;
// the 1.2 multiplier is applied later in applyRecommendationsBasedOnRecentData.
cpuRequest := recommendation.Requests[corev1.ResourceCPU]
cpuRequestMilli := cpuRequest.MilliValue()
// Should be driven by recent values (~75 cores => ~75000m), not the old 1000-core spike.
if cpuRequestMilli < 60000 || cpuRequestMilli > 120000 {
t.Errorf("Expected CPU request to be based on recent data (~60000-120000m), got %d millicores", cpuRequestMilli)
}
if cpuRequestMilli > 200000 {
t.Errorf("CPU request appears to include old data (1000.0), got %d millicores", cpuRequestMilli)
}
}
🤖 Prompt for AI Agents
In @cmd/pod-scaler/resources_test.go around lines 18 - 127, The test depends on
the POD_SCALER_MIN_SAMPLES env and current window, so make it deterministic by
setting the server's minSamples and resourceRecommendationWindow directly before
calling digestData: set resourceServer.minSamples = 3 and
resourceServer.resourceRecommendationWindow = 14*24*time.Hour (so fourWeeksAgo
is excluded and oneWeekAgo included), then call server.digestData(...) and
assert via server.recommendedRequestFor(meta) that the recommendation reflects
only the recent fingerprints; this ensures old data is excluded regardless of
environment.

Comment on lines 13 to 17
Limits: v1.ResourceList{
- s"cpu": {i: resource.int64Amount{value: 16}, Format: "DecimalSI"},
- s"memory": {i: resource.int64Amount{value: 400000000}, Format: "BinarySI"},
+ s"memory": {i: resource.int64Amount{value: 600000000}, Format: "BinarySI"},
+ s"memory": {i: resource.int64Amount{value: 480000000}, Format: "BinarySI"},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find cmd/pod-scaler -name "*.diff" | head -5

Repository: openshift/ci-tools

Length of output: 321


🏁 Script executed:

cat -n cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff | head -100

Repository: openshift/ci-tools

Length of output: 4103


🏁 Script executed:

# Find the mutation implementation
find cmd -name "*.go" | xargs grep -l "MutatePodResources" | head -5

Repository: openshift/ci-tools

Length of output: 94


🏁 Script executed:

cat -n cmd/pod-scaler/admission_test.go | head -150

Repository: openshift/ci-tools

Length of output: 6015


🏁 Script executed:

# Find the mutation implementation
find cmd/pod-scaler -name "*.go" -type f | grep -v test | head -10

Repository: openshift/ci-tools

Length of output: 247


🏁 Script executed:

# Find the TestMutatePodResources test case
grep -n "TestMutatePodResources" cmd/pod-scaler/admission_test.go

Repository: openshift/ci-tools

Length of output: 109


🏁 Script executed:

# Look at the admission.go file for mutation logic
cat -n cmd/pod-scaler/admission.go | head -200

Repository: openshift/ci-tools

Length of output: 8972


🏁 Script executed:

# Find the mutatePodResources function
grep -n "func mutatePodResources" cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 293


🏁 Script executed:

# Read the mutatePodResources function
sed -n '200,400p' cmd/pod-scaler/admission.go | cat -n

Repository: openshift/ci-tools

Length of output: 9955


🏁 Script executed:

# Get the TestMutatePodResources test case starting at line 339
sed -n '339,600p' cmd/pod-scaler/admission_test.go | cat -n

Repository: openshift/ci-tools

Length of output: 11214


🏁 Script executed:

# Check if the fixture file is auto-generated or manually written
head -20 cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff

Repository: openshift/ci-tools

Length of output: 1102


🏁 Script executed:

# Let me trace through the logic manually for the 10->7 case
# Check the exact math in applyRecommendationsBasedOnRecentData
cat -n cmd/pod-scaler/admission.go | sed -n '75,90p'

Repository: openshift/ci-tools

Length of output: 706


🏁 Script executed:

# Get the exact reduction calculation code
sed -n '75,89p' cmd/pod-scaler/admission.go | cat -n

Repository: openshift/ci-tools

Length of output: 637


🏁 Script executed:

# Now trace the 10->7 case manually: 
# ours (recommendation) = 5 (from test line ~66-68), theirs = 10
# reductionPercent = 1 - (5/10) = 0.5 = 50%
# maxAllowed = 10 * (1 - 0.25) = 7.5
# int64(7.5) = 7
# This is a 30% reduction (10-7=3, 3/10=0.3), not 25%

python3 << 'EOF'
# Verify the math
theirs = 10
ours = 5
reduction_percent = 1.0 - (ours / theirs)
print(f"Reduction percent: {reduction_percent:.1%}")

max_allowed = theirs * (1.0 - 0.25)
print(f"Max allowed (float): {max_allowed}")
print(f"Max allowed (int64): {int(max_allowed)}")

actual_reduction = (theirs - int(max_allowed)) / theirs
print(f"Actual reduction after int truncation: {actual_reduction:.1%}")
EOF

Repository: openshift/ci-tools

Length of output: 148


🏁 Script executed:

# Verify that the test expectations confirm this is happening
# Check the fixture to see if 10->7 is the expected result
grep -A2 "value: 10" cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff | head -5

Repository: openshift/ci-tools

Length of output: 335


🏁 Script executed:

# Verify CPU limits removal is intentional by checking reconcileLimits
sed -n '97,118p' cmd/pod-scaler/admission.go | cat -n

Repository: openshift/ci-tools

Length of output: 1176


🏁 Script executed:

# Check if there are any tests verifying the 25% cap behavior specifically
grep -n "25" cmd/pod-scaler/admission_test.go

Repository: openshift/ci-tools

Length of output: 1458


🏁 Script executed:

# Check the comments in the reduction logic more carefully
sed -n '66,95p' cmd/pod-scaler/admission.go | cat -n

Repository: openshift/ci-tools

Length of output: 1342


🏁 Script executed:

# Check the reconcileLimits function comment more carefully
sed -n '97,104p' cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 434


🏁 Script executed:

# Verify the test setup and what values are being used
sed -n '89,131p' cmd/pod-scaler/admission_test.go | cat -n

Repository: openshift/ci-tools

Length of output: 1788


🏁 Script executed:

# Let me check what CPU value the test setup provides for "overcap" container
sed -n '66,71p' cmd/pod-scaler/admission_test.go | cat -n

Repository: openshift/ci-tools

Length of output: 361


🏁 Script executed:

# And verify the test container that has CPU=10
sed -n '122,131p' cmd/pod-scaler/admission_test.go | cat -n

Repository: openshift/ci-tools

Length of output: 568


🏁 Script executed:

# Let me verify by looking at the exact test input and output more carefully
# Check what container names are being used in the test
sed -n '88,132p' cmd/pod-scaler/admission_test.go | grep -E 'Name:|NewQuantity.*value: (10|5)' | head -20

Repository: openshift/ci-tools

Length of output: 44


🏁 Script executed:

# Confirm the resource recommendations in the test
sed -n '47,72p' cmd/pod-scaler/admission_test.go | grep -E 'baseWithContainer|NewQuantity'

Repository: openshift/ci-tools

Length of output: 44


🏁 Script executed:

# Let me directly verify the test container configuration
sed -n '88,132p' cmd/pod-scaler/admission_test.go

Repository: openshift/ci-tools

Length of output: 1610


🏁 Script executed:

# Let me get the complete test container definitions from TestMutatePodResources
sed -n '339,500p' cmd/pod-scaler/admission_test.go | grep -A 15 "Name: \"small\"" | tail -30

Repository: openshift/ci-tools

Length of output: 972


🏁 Script executed:

# Double-check the reconcileLimits function
sed -n '97,118p' cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 1022


🏁 Script executed:

# Confirm the exact reduction calculation by examining the code logic
sed -n '75,95p' cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 837


🏁 Script executed:

# Search for any related issue or TODO comments about rounding/truncation
grep -n "round\|truncat\|int64" cmd/pod-scaler/admission.go | head -20

Repository: openshift/ci-tools

Length of output: 850


🏁 Script executed:

# Verify the exact test setup and expected behavior
cat -n cmd/pod-scaler/admission.go | sed -n '66,90p'

Repository: openshift/ci-tools

Length of output: 1064


🏁 Script executed:

# Check if there are any recent fixes or TODOs related to this rounding issue
grep -C 3 "Set(int64" cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 762


Fix the truncation bug in the 25% reduction cap: maxAllowed = theirValue * (1.0 - maxReductionPercent) followed by our.Set(int64(maxAllowed)) truncates 7.5 to 7, allowing 30% reduction instead of 25%.

The CPU limits removal is intentional (see reconcileLimits() which explicitly deletes CPU limits, with a comment that it's not useful). However, the 10 -> 7 CPU reduction in lines 73-77 violates the "max 25% reduction per cycle" guarantee. When the cap calculation yields 7.5, truncating to int64(7) results in a 30% reduction. Round up or apply the cap before converting to int64.

Also applies to: 73-77

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 12, 2026

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 7607680 link false /test breaking-changes
ci/prow/images 7607680 link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants