Skip to content

Conversation

@deepsm007
Copy link
Contributor

@deepsm007 deepsm007 commented Jan 6, 2026

Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from ci_operator_metrics table, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode.

/cc @openshift/test-platform

@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 6, 2026
@openshift-ci openshift-ci bot requested a review from a team January 6, 2026 16:06
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 6, 2026

@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

• Adds pod classification system that labels pods as "normal" or "measured" based on whether they need fresh resource measurement data (measured if last measurement >10 days ago or never measured)
• Implements podAntiAffinity rules to ensure measured pods run on isolated nodes with no other CI workloads, allowing accurate CPU/memory utilization measurement without node contention
• Integrates BigQuery client to query and cache max CPU/memory utilization from measured pod runs, refreshing daily to keep data current
• Applies measured resource recommendations only to the longest-running container in each pod, using actual utilization data instead of Prometheus metrics that may be skewed by node contention

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@deepsm007 deepsm007 changed the title DPTP-4613: pod-scaler: Implement measured pods for accurate resource measurement DPTP-4613: pod-scaler implement measured pods for accurate resource measurement Jan 6, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 6, 2026

Walkthrough

This PR integrates measured pod support into the pod-scaler system by adding BigQuery-backed data collection and caching. It extends the admission and producer flows to classify pods, apply anti-affinity rules, and adjust resource constraints based on measured utilization. A retry mechanism is added to manifest pushing for reliability.

Changes

Cohort / File(s) Summary
Measured Pod Core Implementation
cmd/pod-scaler/measured.go, cmd/pod-scaler/measured_test.go
New module implementing measured pod data pipeline: defines MeasuredPodData, MeasuredPodCache, and BigQueryClient types; introduces ClassifyPod, AddPodAntiAffinity, and ApplyMeasuredPodResources functions; handles BigQuery querying, caching with thread-safety, pod classification by metadata, anti-affinity rule generation, and resource scaling with 20% safety margin. Test file covers ShouldBeMeasured, ClassifyPod, AddPodAntiAffinity, and cache operations.
Main Flow Integration
cmd/pod-scaler/main.go
Extends producerOptions struct with enableMeasuredPods, bigQueryProjectID, bigQueryDatasetID, bigQueryCredentialsFile fields. Updates mainProduce and mainAdmission to conditionally create BigQuery client, validate configuration, and pass client downstream.
Admission Integration
cmd/pod-scaler/admission.go
Adds bqClient parameter to admit function and podMutator struct. Augments handle flow to classify pods, apply anti-affinity, and apply measured resource constraints when BigQueryClient is present.
Producer Integration
cmd/pod-scaler/producer.go
Extends produce function to accept BigQueryClient parameter. Adds collectMeasuredPodMetrics, extractMeasuredPodData, queryPodMetrics, and writeMeasuredPodsToBigQuery functions. Introduces escapePromQLLabelValue helper and measuredPodData, bigQueryPodMetricsRow types. Collects per-container CPU/memory metrics and container durations, then persists to BigQuery.
Manifest Pushing Enhancement
pkg/manifestpusher/manifestpusher.go
Adds architecture extraction from build nodeSelector and per-build image reference logging. Replaces direct manifest-list push with retry mechanism using exponential backoff to handle race conditions where images may not be immediately available. Detects missing image errors to trigger retries; other errors fail immediately.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deepsm007

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI Agents
In @cmd/pod-scaler/measured.go:
- Around line 330-332: The current assignment to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
overwrites any existing anti-affinity rules; instead, merge by appending
requiredTerms to any existing slice: ensure pod.Spec.Affinity and
pod.Spec.Affinity.PodAntiAffinity are non-nil, read the existing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
slice (if non-nil), append the requiredTerms elements to it, and then set the
field to the combined slice so existing rules are preserved.
- Around line 197-215: The FullMetadata used to build MeasuredPodData is missing
Variant, Target, Pod, and Step which will cause cache misses in
MetadataFor()/ClassifyPod; update the BigQuery SQL to select variant, target,
pod, and step, add those fields to the row struct used in
cmd/pod-scaler/measured.go, and populate podscaler.FullMetadata (Variant,
Target, Pod, Step) from the row when constructing meta so MeasuredPodData
contains the same metadata keys as MetadataFor().
🧹 Nitpick comments (4)
cmd/pod-scaler/main.go (2)

112-161: Add validation for measured pods flags in options.validate().

When --enable-measured-pods is true, the BigQuery project and dataset IDs should be validated upfront in options.validate() rather than at runtime in admit(). This provides fail-fast behavior and clearer error messages during startup.

🔎 Proposed fix

Add the following validation within the case "consumer.admission": block (around line 138):

 		if err := o.resultsOptions.Validate(); err != nil {
 			return err
 		}
+		if o.enableMeasuredPods {
+			if o.bigQueryProjectID == "" {
+				return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
+			}
+			if o.bigQueryDatasetID == "" {
+				return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
+			}
+		}
 
 	default:

282-282: Consider grouping parameters into a configuration struct.

The admit() function now has 16 parameters, making it harder to read and maintain. Consider grouping related parameters into a configuration struct.

cmd/pod-scaler/admission.go (1)

68-81: Minor formatting inconsistency.

Line 75 has inconsistent spacing before the type compared to other fields.

🔎 Proposed fix
-	memoryCap                    string
+	memoryCap                   string
cmd/pod-scaler/measured_test.go (1)

1-253: Good test coverage, consider adding tests for ApplyMeasuredPodResources.

The tests cover ShouldBeMeasured, ClassifyPod, AddPodAntiAffinity, and MeasuredPodCache. However, ApplyMeasuredPodResources is not tested, which contains important logic for:

  • Finding the longest-running container
  • Applying the 20% buffer to CPU/memory requests
  • Fallback to first container when duration data is missing
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 428df2b and 13781c2.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/admission.go
🧬 Code graph analysis (1)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (94-101)
  • NewMeasuredPodCache (70-75)
  • NewBigQueryClient (104-137)
  • ClassifyPod (247-276)
  • AddPodAntiAffinity (281-333)
  • ApplyMeasuredPodResources (338-411)
🔇 Additional comments (6)
cmd/pod-scaler/admission_test.go (1)

557-557: LGTM!

The test correctly updated to match the new mutatePodResources signature, passing false for both authoritativeCPURequests and authoritativeMemoryRequests to preserve existing test behavior.

cmd/pod-scaler/admission.go (1)

117-126: LGTM - correct ordering of measured pod operations.

The sequence is correct:

  1. Classify pod based on metadata
  2. Add anti-affinity rules
  3. Apply measured resources
  4. Then apply standard resource mutation (which uses useOursIfLarger to only increase resources)

This ensures measured data is applied first, and standard recommendations only increase values if they're higher.

cmd/pod-scaler/measured_test.go (1)

16-61: LGTM - well-structured unit tests.

Good coverage of edge cases for ShouldBeMeasured: no data, recent data (5 days), and stale data (15 days). The test correctly validates the 10-day threshold behavior.

cmd/pod-scaler/measured.go (3)

85-91: Cache update replaces entire dataset - verify this is intentional.

The Update method replaces the entire cache map rather than merging. This is appropriate for a daily full refresh from BigQuery, but be aware that any data not returned in the latest query will be lost.


396-410: LGTM - consistent 20% buffer application.

The 1.2x buffer matches the existing pattern in useOursIfLarger (120% increase). The CPU calculation correctly converts cores to millicores. For measured pods, this authoritative data appropriately replaces existing requests.


264-267: Defensive nil handling defaults to measured - reasonable behavior.

When bqClient is nil (which shouldn't happen in normal flow since ClassifyPod is only called when bqClient != nil in admission.go), defaulting to "measured" is conservative and safe.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Line 321: The function mutatePodResources has two unused boolean parameters
named as anonymous `_ bool, _ bool` (meant for authoritativeCPURequests and
authoritativeMemoryRequests); either implement their intended behavior to skip
mutation for pods with authoritative requests by adding explicit parameters
(e.g., authoritativeCPURequests, authoritativeMemoryRequests) to the signature
and branching early in mutatePodResources to return without mutating when those
flags indicate authority, or remove the anonymous bool parameters entirely from
the signature and all call sites to avoid dead parameters; update references to
the function (callers and any documentation/tests) to match the new signature to
keep the API consistent.

In @cmd/pod-scaler/measured.go:
- Around line 123-133: Startup currently calls bq.Refresh once and only warns on
error, leaving the cache empty so ShouldBeMeasured() returns true for every pod;
change the startup behavior to retry bq.Refresh until success (with a capped
backoff/retry limit) before scheduling the daily refresh via
interrupts.TickLiteral, or alternatively set a flag on the BigQuery client
(e.g., cachePopulated) that is only true after a successful bq.Refresh and
update ShouldBeMeasured to return false while that flag is false; ensure you log
each retry and ultimately fail fast (exit or return an error) if retries are
exhausted, and keep BigQueryRefreshInterval-based TickLiteral unchanged for the
periodic refresh.
🧹 Nitpick comments (3)
cmd/pod-scaler/main.go (2)

99-102: Missing validation for required BigQuery flags.

The help text states these flags are "required if enable-measured-pods is true", but options.validate() doesn't enforce this. The validation happens in admit() instead, which is inconsistent with the validation pattern for other consumer.admission flags (lines 123-138).

♻️ Suggested fix in validate()

Add this check in the case "consumer.admission": block around line 138:

if o.enableMeasuredPods {
    if o.bigQueryProjectID == "" {
        return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
    }
    if o.bigQueryDatasetID == "" {
        return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
    }
}

282-283: Function call has many parameters - consider passing a config struct.

The admit() function now takes 16 parameters, making the call hard to read and maintain. Consider grouping related parameters into a configuration struct in a future refactor.

cmd/pod-scaler/measured_test.go (1)

151-210: Consider adding assertions for label selector content.

The tests verify that anti-affinity structures are created but don't validate the actual LabelSelectorRequirement values. This could miss bugs where the wrong operator or values are used.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 13781c2 and a5b8e74.

⛔ Files ignored due to path filters (1)
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (4)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/measured_test.go (3)
cmd/pod-scaler/measured.go (8)
  • NewMeasuredPodCache (69-74)
  • BigQueryClient (93-100)
  • MeasuredPodData (43-58)
  • ClassifyPod (246-275)
  • PodScalerLabelKey (26-26)
  • PodScalerLabelValueMeasured (32-32)
  • PodScalerLabelValueNormal (29-29)
  • AddPodAntiAffinity (280-332)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
pkg/api/types.go (1)
  • Metadata (161-166)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (2)
  • Metadata (161-166)
  • ResourceList (261-261)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (12)
cmd/pod-scaler/measured.go (6)

1-21: LGTM!

Imports are well-organized and appropriate for the functionality being implemented.


23-100: LGTM!

Constants are well-documented, and the type definitions are clean with appropriate thread-safety for the cache.


223-236: LGTM!

The measurement staleness logic is clear and correctly implements the 10-day retention policy.


297-311: Measured pod anti-affinity may be too restrictive.

The LabelSelectorOpExists operator makes measured pods avoid ALL pods with the pod-scaler label, including themselves. This means if two measured pods are created simultaneously, they cannot share the same node - which is the intended behavior. However, verify that the cluster has sufficient nodes to handle multiple concurrent measured pods.


390-409: LGTM!

Resource application logic correctly converts CPU to millicores and memory to bytes with appropriate 20% safety buffers.


143-164: The placeholder query issue is already explicitly flagged in the code via the TODO comments at lines 143-145, which clearly state this is a placeholder query requiring alignment with the actual BigQuery schema. No additional review comment is needed for this concern.

cmd/pod-scaler/main.go (1)

70-77: LGTM!

New measured pods configuration fields are well-documented and logically grouped.

cmd/pod-scaler/measured_test.go (3)

16-61: LGTM!

TestShouldBeMeasured covers the three key scenarios: no data, recent data, and stale data. The test correctly validates the measurement decision logic.


63-149: LGTM!

Test case 2 correctly uses MetadataFor to ensure metadata matching, demonstrating good awareness of the cache key requirements. Good coverage of the nil BigQuery client fallback behavior.


212-253: LGTM!

Cache tests cover the essential operations: empty lookup and data retrieval after update with field verification.

cmd/pod-scaler/admission.go (2)

36-66: LGTM!

Measured pods initialization is properly gated behind the feature flag with appropriate validation and logging.


68-81: LGTM!

The extended podMutator type cleanly integrates the new measured pods fields alongside existing configuration.

@deepsm007
Copy link
Contributor Author

/hold and wait for #4886 to be reviewd/merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 7, 2026
@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch 2 times, most recently from 211c573 to 3ac074f Compare January 7, 2026 17:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Around line 117-126: Measured pod resources set by ApplyMeasuredPodResources
can be overwritten by mutatePodResources; update the flow so measured pods are
treated authoritative: after ClassifyPod (which adds the measured label), detect
the measured label on the pod (or call a helper like isMeasuredPod) and if
measured, either skip calling mutatePodResources entirely or call
mutatePodResources with a flag that prevents
applyRecommendationsBasedOnRecentData from increasing resources; make the change
near the existing call sites using the functions ApplyMeasuredPodResources,
mutatePodResources, ClassifyPod and applyRecommendationsBasedOnRecentData so
measured pods retain their measured resources.

In @cmd/pod-scaler/measured_test.go:
- Around line 24-31: Test constructs podscaler.FullMetadata manually missing
fields that MetadataFor() would set (Target, Pod, Step), causing divergence from
production behavior used by ClassifyPod; update the test to build metadata via
podscaler.MetadataFor(...) using the same pod labels/annotations the production
code uses (or explicitly populate Target, Pod, and Step on the FullMetadata
instance) so the test mirrors real metadata and cache lookup keys used by
ClassifyPod.

In @cmd/pod-scaler/measured.go:
- Around line 329-331: The current assignment overwrites any existing
anti-affinity rules; change it to append requiredTerms to the existing slice
instead: ensure pod.Spec.Affinity and pod.Spec.Affinity.PodAntiAffinity are
non-nil (initialize them if needed) and then do
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
=
append(pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
requiredTerms...) so existing rules are preserved.
- Around line 123-133: When initial bq.Refresh(ctx) fails at startup (leaving
the BigQuery cache empty and causing ShouldBeMeasured() to return true for all
pods), implement retry/backoff or set a "bigQueryReady" boolean on the BigQuery
client so that until Refresh() succeeds you treat the DB as not-ready; modify
the startup block calling bq.Refresh(ctx) to loop with limited retries and delay
(or use exponential backoff) and only log error and continue after marking
bigQueryReady=false, and update ShouldBeMeasured() to return false (or
fallback-safe behavior) when bigQueryReady is false; keep
BigQueryRefreshInterval scheduling but ensure the scheduled tick, when a
subsequent Refresh() succeeds, sets bigQueryReady=true so normal measurement
resumes.
- Around line 205-206: The TODO leaves containerDurations empty so
ApplyMeasuredPodResources cannot pick the longest container; parse the
container_durations JSON string into the containerDurations
map[string]time.Duration before use: locate where containerDurations is
declared, unmarshal the JSON (e.g., from the measured object or incoming field
named "container_durations") into a temporary map[string]string or
map[string]interface{}, convert each value to time.Duration using
time.ParseDuration (and handle numeric values by interpreting them as seconds if
needed), populate containerDurations accordingly, and on any parse error log it
and fall back to the existing default behavior so ApplyMeasuredPodResources can
correctly evaluate container runtimes.
🧹 Nitpick comments (2)
cmd/pod-scaler/main.go (1)

116-165: Missing validation for BigQuery flags when measured pods are enabled.

When --enable-measured-pods is true, --bigquery-project-id and --bigquery-dataset-id are required, but this validation only happens later in admit() (which calls logrus.Fatal). Moving this validation to validate() would provide earlier, cleaner error reporting.

🔧 Suggested addition to validate()
 	case "consumer.admission":
 		// ... existing validation ...
+		if o.enableMeasuredPods {
+			if o.bigQueryProjectID == "" {
+				return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
+			}
+			if o.bigQueryDatasetID == "" {
+				return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
+			}
+		}
cmd/pod-scaler/admission_test.go (1)

730-737: Consider adding tests for authoritative mode.

The renamed function applyRecommendationsBasedOnRecentData now supports authoritative CPU/memory reduction, but all tests pass false, false. Consider adding test cases where authoritativeCPU or authoritativeMemory is true to verify the reduction logic.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between a5b8e74 and 3ac074f.

📒 Files selected for processing (6)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources.go
🧬 Code graph analysis (3)
cmd/pod-scaler/measured_test.go (2)
cmd/pod-scaler/measured.go (8)
  • NewMeasuredPodCache (69-74)
  • BigQueryClient (93-100)
  • MeasuredPodData (43-58)
  • ClassifyPod (246-275)
  • PodScalerLabelKey (26-26)
  • PodScalerLabelValueMeasured (32-32)
  • PodScalerLabelValueNormal (29-29)
  • AddPodAntiAffinity (280-332)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
cmd/pod-scaler/resources.go (1)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
🔇 Additional comments (12)
cmd/pod-scaler/measured.go (4)

62-90: LGTM - Thread-safe cache implementation.

The MeasuredPodCache correctly uses sync.RWMutex for concurrent access. The Update method replaces the entire map atomically which is appropriate for bulk updates from BigQuery.


243-275: LGTM - ClassifyPod logic is sound.

The function correctly initializes labels if nil, checks each container, and defaults to "measured" when BigQuery is unavailable. This fail-safe approach ensures new pods get measured when there's no data.


390-409: LGTM - Resource application with safety buffer.

The 20% buffer on measured data and proper nil-checks for Resources.Requests are appropriate. The logic correctly targets only the longest-running container.


196-214: Incomplete metadata construction - missing Variant, Target, Pod, and Step fields.

The FullMetadata struct (per pkg/pod-scaler/types.go) includes Variant, Target, Pod, and Step fields that are not populated here. Since MetadataFor() in ClassifyPod populates these from pod labels, cache lookups will fail to match.

Likely an incorrect or invalid review comment.

cmd/pod-scaler/resources.go (2)

101-127: LGTM - Recent data filtering and weighting implementation.

The time-window filtering with sample count validation prevents recommendations based on stale or insufficient data. The weighting scheme (3x/2x/1x) appropriately prioritizes recent usage patterns.


140-149: LGTM - Minimum threshold enforcement.

Clamping CPU to at least 10m and memory to at least 10Mi prevents scheduling issues from zero/tiny recommendations.

cmd/pod-scaler/measured_test.go (2)

151-210: LGTM - Anti-affinity tests cover key scenarios.

Tests verify measured pods avoid all pod-scaler pods, normal pods avoid measured pods, and unlabeled pods get no affinity rules.


212-253: LGTM - Cache tests verify basic Get/Update behavior.

The test correctly validates empty cache returns false, and populated cache returns correct values.

cmd/pod-scaler/admission_test.go (1)

557-557: LGTM - Test updated for new function signature.

The test correctly passes false, false for the new authoritative flags, preserving existing behavior verification.

cmd/pod-scaler/admission.go (2)

225-305: LGTM - Authoritative reduction logic with safety guards.

The gradual reduction approach (max 25% per cycle, minimum 5% threshold) prevents aggressive resource cuts that could cause OOMs. The early return for zero recommendations (lines 247-249) is a good addition.


43-55: LGTM - BigQuery client initialization properly guarded.

The initialization is correctly gated behind enableMeasuredPods with required field validation and proper error handling via logrus.Fatal.

cmd/pod-scaler/main.go (1)

286-287: Parameter order is correct—no action needed.

The 16 parameters in the admit() call match the function signature exactly. No misalignment detected.

Comment on lines +123 to +137
// Initial refresh on startup
if err := bq.Refresh(ctx); err != nil {
logger.WithError(err).Warn("Failed to refresh BigQuery data on startup")
}

// Schedule daily refresh
interrupts.TickLiteral(func() {
if err := bq.Refresh(interrupts.Context()); err != nil {
logger.WithError(err).Error("Failed to refresh BigQuery data")
}
}, BigQueryRefreshInterval)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initial refresh failure leaves cache empty, classifying all pods as "measured".

If BigQuery is unavailable at startup, Refresh() fails with only a warning. Since the cache remains empty, ShouldBeMeasured() returns true for all pods, triggering unnecessary isolation for the entire cluster. Consider retry logic or a "not ready" state.

🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 123 - 133, When initial
bq.Refresh(ctx) fails at startup (leaving the BigQuery cache empty and causing
ShouldBeMeasured() to return true for all pods), implement retry/backoff or set
a "bigQueryReady" boolean on the BigQuery client so that until Refresh()
succeeds you treat the DB as not-ready; modify the startup block calling
bq.Refresh(ctx) to loop with limited retries and delay (or use exponential
backoff) and only log error and continue after marking bigQueryReady=false, and
update ShouldBeMeasured() to return false (or fallback-safe behavior) when
bigQueryReady is false; keep BigQueryRefreshInterval scheduling but ensure the
scheduled tick, when a subsequent Refresh() succeeds, sets bigQueryReady=true so
normal measurement resumes.

@deepsm007
Copy link
Contributor Author

deepsm007 commented Jan 7, 2026

/test images
/pipeline required

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 7, 2026

@deepsm007: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test checkconfig
/test codegen
/test e2e
/test format
/test frontend-checks
/test images
/test integration
/test integration-optional-test
/test lint
/test unit
/test validate-vendor

The following commands are available to trigger optional jobs:

/test breaking-changes
/test e2e-oo
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-ci-tools-main-breaking-changes
pull-ci-openshift-ci-tools-main-checkconfig
pull-ci-openshift-ci-tools-main-codegen
pull-ci-openshift-ci-tools-main-format
pull-ci-openshift-ci-tools-main-frontend-checks
pull-ci-openshift-ci-tools-main-images
pull-ci-openshift-ci-tools-main-integration
pull-ci-openshift-ci-tools-main-lint
pull-ci-openshift-ci-tools-main-security
pull-ci-openshift-ci-tools-main-unit
pull-ci-openshift-ci-tools-main-validate-vendor
Details

In response to this:

/test image
/pipeline required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007
Copy link
Contributor Author

/test images

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 3ac074f to bb0253c Compare January 8, 2026 15:07
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cmd/pod-scaler/main.go (1)

127-142: Add validation for measured pods BigQuery requirements.

When --enable-measured-pods is enabled, the BigQuery parameters (--bigquery-project-id, --bigquery-dataset-id, and optionally --bigquery-credentials-file) should be validated as required. Currently, missing these parameters will cause runtime failures instead of failing fast at startup with a clear error message.

✅ Suggested validation logic
 	case "consumer.admission":
 		if o.port == 0 {
 			return errors.New("--port is required")
 		}
 		if o.certDir == "" {
 			return errors.New("--serving-cert-dir is required")
 		}
 		if cpuCap := resource.NewQuantity(o.cpuCap, resource.DecimalSI); cpuCap.Sign() <= 0 {
 			return errors.New("--cpu-cap must be greater than 0")
 		}
 		if memoryCap := resource.MustParse(o.memoryCap); memoryCap.Sign() <= 0 {
 			return errors.New("--memory-cap must be greater than 0")
 		}
+		if o.enableMeasuredPods {
+			if o.bigQueryProjectID == "" {
+				return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
+			}
+			if o.bigQueryDatasetID == "" {
+				return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
+			}
+			// Note: bigQueryCredentialsFile may use default application credentials if not specified
+		}
 		if err := o.resultsOptions.Validate(); err != nil {
 			return err
 		}
cmd/pod-scaler/admission.go (1)

250-253: Potential resource inflation from stacked 1.2x buffers.

The 1.2x safety buffer here is applied unconditionally to all recommendations. For measured pods, ApplyMeasuredPodResources already applies a 1.2x buffer (lines 396 and 405 in measured.go). If both are applied, resources would be inflated by ~44% (1.2 × 1.2) instead of the intended 20%.

Consider either:

  1. Skipping the 1.2x here for resources already set by measured pod logic, or
  2. Documenting that the stacking is intentional
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 329-331: The code overwrites existing anti-affinity rules by
assigning requiredTerms to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution;
instead, preserve and append: ensure PodAffinity/PodAntiAffinity and
RequiredDuringSchedulingIgnoredDuringExecution are non-nil, then append
requiredTerms to the existing slice (e.g., use append(existingSlice,
requiredTerms...)) so you merge new rules instead of replacing them, updating
the
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
field accordingly.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 3ac074f and bb0253c.

📒 Files selected for processing (10)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/admission.go
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/run/consumer.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (3)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (93-100)
  • NewMeasuredPodCache (69-74)
  • NewBigQueryClient (103-136)
  • ClassifyPod (246-275)
  • AddPodAntiAffinity (280-332)
  • ApplyMeasuredPodResources (337-410)
pkg/api/types.go (1)
  • ResourceRequirements (246-253)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (16)
test/e2e/pod-scaler/run/producer.go (1)

8-8: LGTM!

The addition of the POD_SCALER_MIN_SAMPLES=1 environment variable properly configures the test environment for minimum sample handling, consistent with the measured pods feature.

Also applies to: 31-31

test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)

1-1: LGTM!

The fixture correctly reflects the architectural shift away from patch-based mutations in the admission flow, aligning with the new measured-pod implementation.

test/e2e/pod-scaler/run/consumer.go (1)

71-71: LGTM!

The environment variable configuration mirrors the producer setup and properly configures the admission server for testing with minimum sample handling.

cmd/pod-scaler/admission_test.go (1)

557-557: LGTM!

The test updates correctly reflect the new function signatures with authoritative CPU/memory flags set to false, preserving the original non-authoritative behavior for existing test cases. The function rename from useOursIfLarger to applyRecommendationsBasedOnRecentData also improves clarity.

Also applies to: 732-732, 817-817

cmd/pod-scaler/main.go (2)

64-78: LGTM!

The new configuration options for measured pods and authoritative requests are well-documented with clear help text. The field naming and structure are consistent with existing patterns.

Also applies to: 101-106


286-286: LGTM with dependency on validation.

The admit() call correctly passes all new configuration parameters for measured pods and authoritative requests. Ensure the validation issue flagged above is addressed to prevent runtime failures.

cmd/pod-scaler/resources.go (4)

42-58: Constants look reasonable for the use case.

The 3-week window, 10m CPU minimum, and 10Mi memory minimum are sensible defaults to prevent stale data influence and avoid zero/near-zero recommendations that could cause scheduling issues.


99-131: Recency-weighted histogram merging is well-designed.

The tiered weighting (3x for <7 days, 2x for 7-14 days, 1x for older) sensibly prioritizes recent usage patterns. The minimum sample threshold prevents unreliable recommendations based on sparse data.


144-153: Minimum threshold clamping prevents degenerate recommendations.

Ensuring CPU ≥ 10m and memory ≥ 10Mi avoids scheduling issues from near-zero resource requests.


168-178: Environment variable override for test flexibility is appropriate.

The default of 3 samples provides a reasonable minimum for production while allowing tests to use fewer samples.

cmd/pod-scaler/admission.go (2)

43-55: BigQuery client initialization with proper validation.

Correctly validates required parameters and fails fast on startup if configuration is incomplete.


271-299: Gradual reduction logic is well-designed.

The 25% cap per cycle and 5% threshold for skipping micro-adjustments provide a safe, conservative approach to reducing over-provisioned resources.

cmd/pod-scaler/measured.go (4)

23-58: Well-structured constants and data types.

The pod classification labels, retention period, and MeasuredPodData structure are clearly designed for the measured pods feature.


62-90: Thread-safe cache implementation is correct.

Proper use of RWMutex for concurrent read access and exclusive write access. Full map replacement in Update is a clean approach that avoids partial update issues.


334-409: Resource application logic is sound with sensible fallbacks.

The approach of targeting the longest-running container and applying a 20% safety buffer is appropriate. The fallback to the first container when duration data is unavailable provides graceful degradation.

Note: As mentioned in the admission.go review, verify that the 1.2x buffer here doesn't stack with another 1.2x in applyRecommendationsBasedOnRecentData.


246-275: Classification logic handles edge cases well.

Defaulting to "measured" when BigQuery is unavailable is a safe, conservative approach. Marking the entire pod as measured if any container needs fresh data ensures complete coverage.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 40f54dd to 9f38903 Compare January 8, 2026 15:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Around line 252-265: The heuristic using alreadySetByMeasuredPods :=
!their.IsZero() && their.Cmp(our) >= 0 is unreliable—replace the
value-comparison detection with an explicit marker: set an annotation (or rely
on the existing PodScaler label) when resources are applied in
ApplyMeasuredPodResources, then in applyRecommendationsBasedOnRecentData check
that annotation/label (e.g.,
"pod-scaler.openshift.io/measured-resources-applied" == "true" or
PodScalerLabelKey == PodScalerLabelValueMeasured) instead of comparing their and
our so the 1.2x buffer is skipped only when we know measured-pod logic ran.

In @cmd/pod-scaler/measured.go:
- Around line 125-127: The startup currently swallows errors from
bq.Refresh(ctx) by logging a warning via logger.WithError(err).Warn, which
leaves the cache empty and misclassifies pods; change this to fail fast by
propagating or terminating on error: either return the error from the enclosing
initialization function (so callers can handle/fail) or replace the Warn call
with a fatal/exit path (e.g., logger.WithError(err).Fatal or equivalent) to stop
startup when bq.Refresh(ctx) fails; update the call site around bq.Refresh(ctx)
and any function signatures to return an error if choosing propagation.
🧹 Nitpick comments (6)
cmd/pod-scaler/resources.go (1)

96-159: Verify the weighting strategy and consider defensive timestamp validation.

The time-windowing and weighting logic correctly filters recent data and applies appropriate weights. However, consider the following:

  1. Weighting via histogram duplication (lines 118-120): The current approach duplicates histograms to apply weights (3x, 2x, 1x). While simple and effective, this means a single outlier in recent data gets amplified. This may be intentional to prioritize recent patterns, but verify this aligns with the desired sensitivity to recent changes.

  2. No defense against future timestamps (line 107): Consider adding a check that fingerprintTime.Added is not in the future to guard against clock skew or data corruption:

    if fingerprintTime.Added.After(now) {
        metaLogger.Warnf("fingerprint has future timestamp, skipping: %v", fingerprintTime.Added)
        continue
    }
  3. Minimum threshold enforcement (lines 144-153): The post-quantile minimums correctly prevent zero or tiny resource recommendations.

cmd/pod-scaler/measured.go (2)

388-409: Consider the fallback logic for selecting the container.

When container duration data is unavailable, the code falls back to the first container in pod.Spec.Containers (lines 406-409). While this excludes init containers, it may still select a sidecar or helper container if it's listed first.

This could result in applying measured resources to the wrong container if the pod spec doesn't list the main workload container first.

Consider adding a warning log when falling back, or requiring duration data to be present for proper resource application:

if longestContainer == nil && len(pod.Spec.Containers) > 0 {
    longestContainer = &pod.Spec.Containers[0]
    logger.Warnf("No container duration data available, falling back to first container: %s", longestContainer.Name)
}

430-444: Hardcoded 20% safety buffer on measured resources.

The code applies a fixed 20% buffer (lines 431, 440) on top of measured CPU and memory utilization. While this is a reasonable safety margin, consider making this configurable if different workloads require different safety margins.

cmd/pod-scaler/main.go (1)

295-295: Consider refactoring to reduce parameter count.

The admit() function now has 15 parameters, making it difficult to maintain and prone to parameter ordering errors. Consider passing a configuration struct instead.

♻️ Suggested refactoring approach

Create an admitConfig struct to group related parameters:

type admitConfig struct {
    port                        int
    healthPort                  int
    certDir                     string
    mutateResourceLimits        bool
    cpuCap                      int64
    memoryCap                   string
    cpuPriorityScheduling       int64
    authoritativeCPURequests    bool
    authoritativeMemoryRequests bool
    enableMeasuredPods          bool
    bigQueryProjectID           string
    bigQueryDatasetID           string
    bigQueryCredentialsFile     string
}

Then call it as:

go admit(admitConfig{
    port: opts.port,
    healthPort: opts.instrumentationOptions.HealthPort,
    // ... other fields
}, client, loaders(cache), reporter)
cmd/pod-scaler/admission.go (2)

43-56: Redundant validation check.

Lines 45-47 perform validation that's already done in main.go (lines 140-148). While defensive programming is good, this redundancy suggests unclear separation of concerns.

Consider removing this validation since it's already enforced at startup, or if keeping it, add a comment explaining why the defensive check is necessary:

 	var bqClient *BigQueryClient
 	if enableMeasuredPods {
-		if bigQueryProjectID == "" || bigQueryDatasetID == "" {
-			logrus.Fatal("bigquery-project-id and bigquery-dataset-id are required when enable-measured-pods is true")
-		}
+		// Defensive check - should already be validated in main.go
 		cache := NewMeasuredPodCache(logger)

367-393: Consider consolidating measured pod detection logic.

There's redundant checking between this code (lines 367-393) and the alreadySetByMeasuredPods heuristic in applyRecommendationsBasedOnRecentData (line 256). This code uses an explicit label check which is more reliable, but having two places checking for the same condition suggests unclear separation of concerns.

Consider:

  1. Removing the heuristic check in applyRecommendationsBasedOnRecentData and relying solely on this explicit check, OR
  2. Passing a flag from this function to applyRecommendationsBasedOnRecentData indicating whether the container is from a measured pod

This would make the logic clearer and more maintainable.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between bb0253c and 9f38903.

📒 Files selected for processing (10)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/resources.go
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/pod-scaler/measured_test.go
  • test/e2e/pod-scaler/run/consumer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • cmd/pod-scaler/admission.go
  • test/e2e/pod-scaler/run/producer.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/resources.go
🧬 Code graph analysis (3)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (7)
  • BigQueryClient (94-101)
  • NewBigQueryClient (104-137)
  • ClassifyPod (273-302)
  • AddPodAntiAffinity (307-367)
  • ApplyMeasuredPodResources (372-445)
  • PodScalerLabelKey (27-27)
  • PodScalerLabelValueMeasured (33-33)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
🔇 Additional comments (17)
cmd/pod-scaler/resources.go (3)

42-58: LGTM! Well-defined constants with clear documentation.

The time window of 3 weeks and minimum thresholds are appropriate for ensuring stable resource recommendations based on recent data.


60-64: LGTM! Environment-configurable minimum samples.

The default of 3 samples is reasonable, and the environment override provides flexibility for testing scenarios.


168-178: LGTM! Proper environment variable handling with safe defaults.

The function correctly validates the environment variable and falls back to a sensible default of 3 samples.

cmd/pod-scaler/measured.go (6)

24-40: LGTM! Clear constants with good documentation.

The retention period of 10 days and daily refresh cycle are appropriate for balancing data freshness with measurement overhead.


42-91: LGTM! Thread-safe cache with proper synchronization.

The use of sync.RWMutex is appropriate for this read-heavy workload, and the locking patterns are correct.


218-233: Silent handling of JSON parsing errors for container durations.

When parsing the container_durations JSON (lines 222, 229), errors are logged but execution continues with an empty map. This is reasonable for non-critical data, but be aware that:

  • Missing duration data will cause the fallback logic (line 406-409 in ApplyMeasuredPodResources) to apply resources to the first container
  • This may not always be the correct container if it's not the main workload

250-268: LGTM! Clear staleness logic.

The measurement staleness check correctly identifies pods that need fresh data (never measured or >10 days old).


270-302: LGTM! Safe fallback when BigQuery is unavailable.

The classification logic correctly defaults to measuring pods when BigQuery is unavailable (lines 291-293). This is a safe fallback that ensures data collection continues even if BigQuery integration is disabled or fails.


304-367: LGTM! Correct anti-affinity rules for pod isolation.

The anti-affinity logic correctly:

  • Isolates measured pods from ALL pod-scaler labeled pods (lines 324-338)
  • Prevents normal pods from colocating with measured pods (lines 339-354)
  • Preserves existing anti-affinity rules by merging (lines 358-365)

This ensures measured pods get clean isolation for accurate resource measurement.

test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)

1-1: LGTM! Test fixture updated to match new admission response format.

The simplified fixture (no patch payload) aligns with the updated admission flow that handles resource mutations directly rather than via JSON Patch operations.

test/e2e/pod-scaler/run/producer.go (1)

31-31: LGTM! Appropriate test configuration.

Setting POD_SCALER_MIN_SAMPLES=1 for e2e tests is appropriate, as it allows tests to work with limited sample data while still exercising the recommendation logic.

cmd/pod-scaler/admission_test.go (2)

557-557: Test updated for new function signature.

The test correctly passes the new authoritativeCPU and authoritativeMemory parameters as false. Consider adding test cases that exercise these parameters with true values to ensure authoritative mode behaves correctly.


732-732: Function renamed and tests updated.

The tests correctly use the renamed applyRecommendationsBasedOnRecentData function with the new boolean parameters. Consider adding test coverage for the authoritative modes (when either boolean is true) to validate that behavior.

Also applies to: 817-817

cmd/pod-scaler/main.go (2)

64-79: LGTM!

The new fields for measured pods and BigQuery integration are well-structured and documented. The documentation comment clearly explains the purpose of the feature.


101-106: LGTM!

The flag definitions and validation logic are well-structured:

  • Descriptive help text explains each flag's purpose
  • Validation correctly enforces that BigQuery project and dataset IDs are required when measured pods are enabled
  • Appropriately allows credentials file to be optional (falling back to default application credentials)

Also applies to: 140-148

cmd/pod-scaler/admission.go (2)

118-124: LGTM!

The measured pods handling flow is well-structured:

  1. Classify the pod based on measurement data freshness
  2. Add anti-affinity rules for isolated execution
  3. Apply measured resources before standard mutation

This ensures measured pods get accurate resource values from BigQuery data.


282-309: LGTM!

The authoritative mode reduction logic is well-designed with appropriate safety limits:

  • Maximum 25% reduction per cycle prevents sudden resource drops
  • Minimum 5% threshold prevents micro-adjustments
  • Proper edge case handling when current value is zero

This gradual approach ensures stability while still allowing optimization based on actual usage.

Comment on lines +125 to +130
if err := bq.Refresh(ctx); err != nil {
logger.WithError(err).Warn("Failed to refresh BigQuery data on startup")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Startup continues with failed BigQuery refresh.

If the initial refresh fails at startup (line 125), only a warning is logged and the service continues. This means:

  • The cache will be empty initially
  • All pods will be classified as "measured" (since no data exists)
  • This could lead to unnecessary isolation overhead until the next scheduled refresh

Consider if this is the desired behavior or if startup should fail when BigQuery is unavailable.

🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 125 - 127, The startup currently
swallows errors from bq.Refresh(ctx) by logging a warning via
logger.WithError(err).Warn, which leaves the cache empty and misclassifies pods;
change this to fail fast by propagating or terminating on error: either return
the error from the enclosing initialization function (so callers can
handle/fail) or replace the Warn call with a fatal/exit path (e.g.,
logger.WithError(err).Fatal or equivalent) to stop startup when bq.Refresh(ctx)
fails; update the call site around bq.Refresh(ctx) and any function signatures
to return an error if choosing propagation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
cmd/pod-scaler/resources.go (2)

96-159: LGTM: Time-windowing and weighting logic is sound.

The implementation correctly:

  • Filters data to a 3-week recency window
  • Applies age-based weighting (3x/2x/1x) to prioritize recent patterns
  • Validates sample count before making recommendations
  • Enforces minimum thresholds to prevent scheduling issues
♻️ Minor optimization: Reuse time.Now() result

Lines 99-100 call time.Now() twice. Consider reusing a single value:

-cutoffTime := time.Now().Add(-resourceRecommendationWindow)
-now := time.Now()
+now := time.Now()
+cutoffTime := now.Add(-resourceRecommendationWindow)

This is a negligible performance improvement but improves consistency.


171-178: LGTM: Environment variable handling is appropriate.

The function correctly reads and validates the POD_SCALER_MIN_SAMPLES configuration with a sensible default of 3.

♻️ Optional: Log warning for invalid POD_SCALER_MIN_SAMPLES values

Consider logging when an invalid value is provided to aid troubleshooting:

 func getMinSamplesForRecommendation() int {
 	if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" {
-		if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 {
+		minSamples, err := strconv.Atoi(val)
+		if err != nil {
+			logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q: %v, using default", val, err)
+		} else if minSamples <= 0 {
+			logrus.Warnf("POD_SCALER_MIN_SAMPLES must be positive, got %d, using default", minSamples)
+		} else {
 			return minSamples
 		}
 	}
 	return 3
 }

This helps users catch configuration mistakes early, though the silent fallback approach is also acceptable for this use case.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 9f38903 and 9fa9127.

📒 Files selected for processing (1)
  • cmd/pod-scaler/resources.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/resources.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (1)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
🔇 Additional comments (3)
cmd/pod-scaler/resources.go (3)

4-7: LGTM: Imports are appropriate.

The added standard library imports (os, strconv, time) are minimal and directly support the new time-windowing and configuration features.


45-58: LGTM: Constants are well-defined and documented.

The 3-week recommendation window strikes a good balance between recency and data sufficiency, while the minimum thresholds (10 millicores CPU, 10Mi memory) prevent scheduling issues from zero/near-zero recommendations.


60-64: LGTM: Package-level configuration is appropriate.

The environment variable approach for overriding minSamplesForRecommendation is suitable for e2e testing scenarios where sample counts may be limited within the time window.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch 4 times, most recently from 3c5851d to 196c83c Compare January 8, 2026 16:42
@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch 4 times, most recently from be58c3f to d70644b Compare January 9, 2026 15:10
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cmd/pipeline-controller/config_watcher.go (1)

112-117: Comment is misleading: this returns a shallow copy, not a deep copy.

Returning a struct by value copies the struct, but the contained slices (Orgs and nested Repos) share the same underlying arrays with w.config. This works correctly for your use case because decoder.Decode() in reloadConfig() allocates fresh slices each time, so the old and new configs don't share memory.

Consider updating the comment to avoid confusion during future maintenance.

📝 Suggested comment fix
-// getConfigCopy returns a deep copy of the current config for comparison
+// getConfigCopy returns a copy of the current config for comparison.
+// Note: This is a shallow copy, but it's safe because reloadConfig() allocates
+// new slices via the YAML decoder, so old and new configs don't share memory.
 func (w *watcher) getConfigCopy() enabledConfig {
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2f989a1 and d70644b.

📒 Files selected for processing (12)
  • .gitignore
  • cmd/pipeline-controller/config_watcher.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • test/e2e/pod-scaler/run/consumer.go
  • test/e2e/pod-scaler/run/producer.go
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • cmd/pod-scaler/admission_test.go
  • .gitignore
  • test/e2e/pod-scaler/run/producer.go
  • cmd/pod-scaler/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml
  • cmd/pipeline-controller/config_watcher.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/resources.go
  • cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff
  • test/e2e/pod-scaler/run/consumer.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
  • CachedQuery (53-68)
test/e2e/pod-scaler/prometheus/backfill.go (1)
  • Data (43-45)
🔇 Additional comments (17)
cmd/pipeline-controller/config_watcher.go (2)

3-11: Imports are appropriate for the polling-based approach.

The additions of reflect and time support the new ticker-based config watching with deep equality comparison.


85-110: Polling approach is well-reasoned for git-sync compatibility.

The 3-minute polling interval is sensible, and the error handling correctly allows retries without crashing. The change detection logic properly preserves prevConfig when reload fails.

Consider adding a context or stop channel for graceful shutdown if this service needs orderly termination, but this is optional for a long-running daemon.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)

15-54: The test fixture is correct. All three containers receive identical recommendations (5 CPU, 20GB memory) in this test case, which explains the uniform allocation. The 1.2x safety buffer and global resource caps (10 CPU, 20Gi memory) are intentionally applied to each container independently, resulting in the expected values in the fixture. No "longest-running container" selection logic exists—recommendations apply uniformly across all containers per their respective recommended values.

Likely an incorrect or invalid review comment.

test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)

1-1: LGTM!

The simplified AdmissionReview response fixture correctly reflects the new admission flow where patches are handled separately. The response structure with allowed: true and status code 200 is appropriate.

test/e2e/pod-scaler/run/consumer.go (1)

71-71: LGTM!

Adding POD_SCALER_MIN_SAMPLES=1 to the test environment correctly enables e2e tests to run with minimal sample data, aligning with the new configurable threshold in resources.go.

cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)

1-90: LGTM!

The test fixture correctly reflects the new resource handling behavior:

  • Memory limit increase (400M → 480M) demonstrates the 1.2x safety buffer
  • CPU limit removal aligns with reconcileLimits() behavior
  • Updated resource values match the new recommendation logic
cmd/pod-scaler/resources.go (5)

42-58: LGTM!

The new constants are well-documented with clear rationale:

  • 3-week recommendation window provides recency without being too narrow
  • Minimum thresholds (10m CPU, 10Mi memory) prevent scheduling issues from near-zero recommendations

106-122: LGTM!

The recency weighting implementation (3x/2x/1x for past week/fortnight/rest) is a sensible approach for prioritizing recent usage patterns while maintaining historical context.


124-131: LGTM!

The minimum sample validation prevents unreliable recommendations based on insufficient data, with clear debug logging.


144-153: LGTM!

The minimum threshold enforcement correctly prevents near-zero recommendations while preserving appropriate quantity formats.


171-178: LGTM!

The environment variable parsing includes proper validation (positive integer check) with a sensible default fallback.

cmd/pod-scaler/admission.go (6)

43-55: LGTM!

BigQuery client initialization properly validates required parameters and fails fast on configuration errors during startup.


117-126: LGTM!

The measured pod handling flow is well-ordered: classify → add anti-affinity → apply measured resources → standard mutation. This ensures BigQuery-derived resources are applied first, with Prometheus recommendations filling any gaps.


289-317: LGTM!

The gradual reduction logic is well-designed:

  • Division-by-zero protection with theirValue > 0 check
  • 25% max reduction per cycle prevents drastic changes
  • 5% minimum threshold avoids noisy micro-adjustments
  • Clear logging for each decision path

372-401: LGTM!

The measured pod handling in mutatePodResources correctly:

  • Preserves BigQuery-derived resources by skipping Prometheus recommendations when already set
  • Still applies resource caps and limit reconciliation for safety
  • Falls through to normal flow for containers without pre-set resources

68-81: LGTM!

The podMutator struct extension cleanly adds the new fields for measured pod support and authoritative mode configuration.


269-270: Precision loss from float64→int64 conversion is acceptable.

The our.Set(int64(increased)) truncation on line 270 is safe because practical resource values (multiples of 5, 10, 100, etc.) when multiplied by 1.2 typically yield whole numbers. Test cases confirm this: 100 × 1.2 = 120, 200 × 1.2 = 240, 2e10 × 1.2 = 24e9, and so on. For memory (bytes), any minor truncation is negligible; for CPU quantities via resource.Quantity, the API handles these conversions appropriately.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from d70644b to 13781c2 Compare January 9, 2026 15:28
…ed nodes

Implements DPTP-4613 to address the issue where pod-scaler recommendations are
skewed by node contention. When multiple pods with poor CPU configurations are
scheduled on the same node, CPU is maxed out but pods finish eventually. The
pod-scaler observes low CPU utilization (due to node contention) and incorrectly
concludes requests should not be increased, leading to a cycle of reduced limits
and tighter packing.

This change introduces a measured pods system:
- Pods are classified as 'normal' or 'measured' based on whether they need fresh
  measurement data (measured if last measurement >10 days ago or never measured)
- Measured pods use podAntiAffinity rules to run on isolated nodes with no other
  CI workloads, ensuring accurate CPU/memory utilization measurement
- BigQuery integration queries and caches max CPU/memory utilization from measured
  pod runs, refreshing daily to keep data current
- Resource recommendations are applied only to the longest-running container
  in each pod, using actual measured utilization data instead of Prometheus
  metrics that may be skewed by node contention

The feature is opt-in via --enable-measured-pods flag and requires BigQuery
configuration (--bigquery-project-id and --bigquery-dataset-id).
@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch 2 times, most recently from c0b2961 to 2b3e606 Compare January 12, 2026 21:13
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jan 12, 2026

@deepsm007: This pull request references DPTP-4613 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from ci_operator_metrics table, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode.

/cc @openshift/test-platform

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cmd/pod-scaler/producer.go (1)

76-109: Duplicate BigQuery client creation logic.

This code creates a BigQuery client similarly to NewBigQueryClient in measured.go but without the cache and refresh scheduling. Consider reusing NewBigQueryClient or extracting common client creation logic to avoid duplication and inconsistency.

Additionally, the defer bqClient.Close() at line 101 will execute when produce returns, but with execute using TickLiteral (line 84), the function may return before the periodic collection completes, potentially closing the client prematurely.

cmd/pod-scaler/admission.go (1)

321-340: Unused parameters authoritativeCPU and authoritativeMemory.

The authoritativeCPU and authoritativeMemory parameters are added to the function signature but are not used in the function body. This appears to be dead code or incomplete implementation.

#!/bin/bash
# Check if these parameters are used elsewhere or if there's additional implementation
rg -n 'authoritativeCPU|authoritativeMemory' --type=go -C 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 205-206: The containerDurations map is never populated; implement
JSON parsing of the incoming "container_durations" string into
map[string]time.Duration (populate the existing containerDurations variable)
before ApplyMeasuredPodResources is called. Decode the JSON into a temporary
map[string]interface{} or map[string]string, convert each entry to a
time.Duration (use time.ParseDuration for string values and treat numeric values
as seconds), handle and log parsing errors without panicking, and ensure entries
are inserted into containerDurations so ApplyMeasuredPodResources and its
ContainerDurations lookup use the real durations instead of falling back to the
first container.
- Around line 329-331: The code currently overwrites any existing anti-affinity
by assigning requiredTerms to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution;
instead, ensure Affinity and PodAntiAffinity are non-nil, then append
requiredTerms to the existing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
slice (e.g., create the Affinity/PodAntiAffinity objects if nil, then set
RequiredDuringSchedulingIgnoredDuringExecution = append(existingSlice,
requiredTerms...)) so you preserve preexisting rules while adding the pod-scaler
terms.
- Around line 102-136: Add a Close method on BigQueryClient that delegates to
the underlying *bigquery.Client (implement func (bq *BigQueryClient) Close()
error { return bq.client.Close() }) and ensure callers that create the client
via NewBigQueryClient (e.g., where admission.go constructs the client) invoke
bq.Close() during application shutdown or in the relevant cleanup path so the
underlying BigQuery resources are released.
- Around line 342-346: The current check uses pod.Labels[PodScalerLabelKey] and
returns when the label is missing or equals PodScalerLabelValueMeasured, which
incorrectly skips applying historical measurements to normal pods; change the
logic so that pods labeled PodScalerLabelValueMeasured are skipped (to preserve
isolation) and measured data is applied only to pods that are not labeled
measured—i.e., update the condition that uses podScalerLabel, hasLabel,
PodScalerLabelKey and PodScalerLabelValueMeasured to return early when hasLabel
&& podScalerLabel == PodScalerLabelValueMeasured, allowing historical resource
assignment for pods without that label.

In @cmd/pod-scaler/producer.go:
- Around line 411-415: The label selector is incorrect: replace the current
measuredPodSelector value that uses pod_scaler="measured" with the
Prometheus-style label key label_pod_scaler="measured" (underscores replace
hyphens) and keep label_created_by_ci="true"; update measuredPodSelector and any
uses such as podLabelsQuery (which builds kube_pod_labels) so the query becomes
kube_pod_labels with the corrected selector referencing label_pod_scaler and
label_created_by_ci.
- Around line 505-527: The PromQL strings are invalid because they use subquery
timestamp ranges; instead build expressions that QueryRange can sample over
time: replace cpuQuery with a rate expression like
rate(container_cpu_usage_seconds_total{namespace="%s",pod="%s",container!="POD",container!=""}[3m])
(no outer max_over_time or timestamp formatting) and replace memoryQuery with
the raw gauge
container_memory_working_set_bytes{namespace="%s",pod="%s",container!="POD",container!=""}
(again no [start:end] timestamps); keep using client.QueryRange and then compute
the max from the returned samples in Go (the code around cpuResult/memoryResult
processing).
🧹 Nitpick comments (3)
cmd/pod-scaler/measured.go (2)

41-58: Inconsistent JSON tags on struct fields.

MeasuredPodData has a JSON tag on Metadata but not on other exported fields. If this struct is intended for JSON serialization, the other fields should also have JSON tags. If not, consider removing the JSON tag from Metadata for consistency.


263-266: Consider the operational impact of defaulting to "measured" when BigQuery is unavailable.

When bqClient is nil, all pods are classified as "measured" and will require isolated nodes. This could cause scheduling issues if BigQuery connectivity is lost or misconfigured. Consider whether a "normal" default might be safer operationally, or add monitoring/alerting for this condition.

cmd/pod-scaler/measured_test.go (1)

151-210: Tests verify structure existence but not selector content.

The anti-affinity tests check that affinity structures are created but don't verify the actual LabelSelector content (operator type, values). Consider adding assertions to verify the selector correctly uses Exists for measured pods and In with the "measured" value for normal pods.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d70644b and 2b3e606.

📒 Files selected for processing (6)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/pod-scaler/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/admission_test.go
  • cmd/pod-scaler/admission.go
🧬 Code graph analysis (2)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
  • Cache (24-28)
pkg/metrics/insights.go (1)
  • Context (38-38)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
  • PodScalerReporter (174-176)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (93-100)
  • NewMeasuredPodCache (69-74)
  • NewBigQueryClient (103-136)
  • ClassifyPod (246-275)
  • AddPodAntiAffinity (280-332)
  • ApplyMeasuredPodResources (337-410)
🔇 Additional comments (10)
cmd/pod-scaler/measured.go (3)

23-39: LGTM - Well-defined constants with clear documentation.

The constants are appropriately named and the comments clearly explain the purpose of each value.


68-90: LGTM - Thread-safe cache implementation.

The cache correctly uses RLock for reads and Lock for writes, and the atomic map replacement in Update is a clean approach for cache refresh.


223-236: LGTM - Correct staleness check logic.

The function properly checks for missing or stale measurement data using the configured retention period.

cmd/pod-scaler/measured_test.go (3)

16-61: LGTM - Good test coverage for ShouldBeMeasured.

The test covers the three key scenarios: no existing data, recent data (within retention period), and stale data (outside retention period).


63-149: LGTM - Comprehensive test coverage for ClassifyPod.

Tests cover the key scenarios including the nil BigQuery client fallback behavior.


212-253: LGTM - Good cache operation tests.

Tests verify Get/Update operations and data retrieval correctness.

cmd/pod-scaler/producer.go (1)

648-682: LGTM - BigQuery write implementation.

The streaming insert approach is appropriate for the expected data volume. Error handling for JSON marshaling failures is reasonable.

cmd/pod-scaler/admission_test.go (1)

557-557: LGTM - Test updated to match new function signature.

The test correctly passes the new authoritativeCPU and authoritativeMemory parameters as false to maintain existing test behavior.

cmd/pod-scaler/admission.go (2)

36-55: LGTM - BigQuery client initialization with proper validation.

The code correctly validates that BigQuery project and dataset IDs are provided when measured pods are enabled, and creates the client with appropriate error handling.


117-126: LGTM - Measured pod handling integrated into admission flow.

The integration correctly orders the operations: classification, anti-affinity, measured resource application, then standard resource mutation.

Comment on lines +102 to +140
// NewBigQueryClient creates a new BigQuery client for measured pods
func NewBigQueryClient(projectID, datasetID, credentialsFile string, cache *MeasuredPodCache, logger *logrus.Entry) (*BigQueryClient, error) {
ctx := interrupts.Context()
var opts []option.ClientOption
if credentialsFile != "" {
opts = append(opts, option.WithCredentialsFile(credentialsFile))
}

client, err := bigquery.NewClient(ctx, projectID, opts...)
if err != nil {
return nil, fmt.Errorf("failed to create BigQuery client: %w", err)
}

bq := &BigQueryClient{
client: client,
projectID: projectID,
datasetID: datasetID,
logger: logger,
cache: cache,
}

// Initial refresh on startup
if err := bq.Refresh(ctx); err != nil {
logger.WithError(err).Warn("Failed to refresh BigQuery data on startup")
}

// Schedule daily refresh
interrupts.TickLiteral(func() {
if err := bq.Refresh(interrupts.Context()); err != nil {
logger.WithError(err).Error("Failed to refresh BigQuery data")
}
}, BigQueryRefreshInterval)

return bq, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for BigQuery client Close calls
rg -n 'bqClient.*Close|BigQueryClient.*Close' --type=go

Repository: openshift/ci-tools

Length of output: 117


🏁 Script executed:

# Also search more broadly for Close patterns with bigquery
rg -n 'Close\(\)' cmd/pod-scaler/measured.go

Repository: openshift/ci-tools

Length of output: 44


🏁 Script executed:

# Check the full measured.go file structure
wc -l cmd/pod-scaler/measured.go

Repository: openshift/ci-tools

Length of output: 92


🏁 Script executed:

# Look for the BigQueryClient struct definition and its methods
ast-grep --pattern $'type BigQueryClient struct {
  $$$
}'

Repository: openshift/ci-tools

Length of output: 484


🏁 Script executed:

# Search for cleanup/defer patterns and Close methods in the codebase
rg -n 'defer.*Close|\.Close\(\)' cmd/pod-scaler/ --type=go

Repository: openshift/ci-tools

Length of output: 327


🏁 Script executed:

# Check the context around the Close call in producer.go
sed -n '85,115p' cmd/pod-scaler/producer.go

Repository: openshift/ci-tools

Length of output: 1152


🏁 Script executed:

# Look for NewBigQueryClient calls to understand the ownership flow
rg -n 'NewBigQueryClient' cmd/pod-scaler/ --type=go -B2 -A2

Repository: openshift/ci-tools

Length of output: 953


🏁 Script executed:

# Check the full context in admission.go around the bqClient variable
sed -n '1,60p' cmd/pod-scaler/admission.go

Repository: openshift/ci-tools

Length of output: 2338


🏁 Script executed:

# Search for any Close method on BigQueryClient
rg -n 'func.*BigQueryClient.*Close' --type=go

Repository: openshift/ci-tools

Length of output: 44


🏁 Script executed:

# Look for the bqClient variable scope in admission.go
rg -n 'var bqClient' cmd/pod-scaler/ --type=go -B2 -A5

Repository: openshift/ci-tools

Length of output: 1215


Add a Close method to BigQueryClient to properly cleanup the underlying BigQuery client.

The BigQueryClient struct holds a *bigquery.Client but lacks a Close() method to clean it up. In admission.go, the client is created once and never closed, potentially leaving resources open for the application's lifetime. Add a Close() method that delegates to the underlying client's Close():

func (bq *BigQueryClient) Close() error {
	return bq.client.Close()
}

Then call it during shutdown or wrap the creation with proper cleanup.

🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 102 - 136, Add a Close method on
BigQueryClient that delegates to the underlying *bigquery.Client (implement func
(bq *BigQueryClient) Close() error { return bq.client.Close() }) and ensure
callers that create the client via NewBigQueryClient (e.g., where admission.go
constructs the client) invoke bq.Close() during application shutdown or in the
relevant cleanup path so the underlying BigQuery resources are released.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 2b3e606 to 8c9bdcd Compare January 12, 2026 21:35
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/pod-scaler/main.go (1)

120-168: Missing validation for BigQuery configuration when measured pods are enabled.

When enableMeasuredPods is true, the BigQuery fields (bigQueryProjectID, bigQueryDatasetID) should be validated as required. Currently, the feature can be enabled without proper BigQuery configuration, which would cause runtime failures or unexpected behavior (all pods defaulting to "measured" classification).

Proposed fix in validate()
 	case "producer":
+		if o.producerOptions.enableMeasuredPods {
+			if o.producerOptions.bigQueryProjectID == "" {
+				return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
+			}
+			if o.producerOptions.bigQueryDatasetID == "" {
+				return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
+			}
+		}
 		return o.kubernetesOptions.Validate(false)

Add similar validation for consumer.admission mode.

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 329-331: The current assignment to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
overwrites any existing rules; instead, ensure Affinity and PodAntiAffinity are
non-nil, then append requiredTerms to the existing slice (preserving prior
entries) rather than replacing it. Concretely, check/create pod.Spec.Affinity
and pod.Spec.Affinity.PodAntiAffinity if nil, then set
RequiredDuringSchedulingIgnoredDuringExecution = append(existingSlice,
requiredTerms...), handling the case where existingSlice may be nil.
- Around line 263-266: The current else branch defaults shouldBeMeasured = true
when bqClient is nil; change this to default to "normal" by setting
shouldBeMeasured = false, or if the feature flag for BigQuery-based measurement
is enabled, fail-fast by returning an error/denying admission when bqClient ==
nil; update the logic in the function that computes shouldBeMeasured (the block
referencing bqClient and the shouldBeMeasured variable in measured.go) so it
either sets shouldBeMeasured = false when no bqClient is available or returns an
explicit error when the BigQuery measurement feature is enabled but bqClient is
nil.
- Around line 205-214: containerDurations is created empty and never populated
from row.ContainerDurations, so MeasuredPodData.ContainerDurations remains blank
and ApplyMeasuredPodResources cannot find the longest-running container; parse
the JSON string in row.ContainerDurations into a map[string]time.Duration
(handle errors) and populate MeasuredPodData.ContainerDurations before assigning
to data[meta]; update the parsing logic near the containerDurations declaration
and ensure the key names match what ApplyMeasuredPodResources expects (use the
same container name keys) and convert parsed duration values into time.Duration.
- Around line 342-346: The condition currently checks for
PodScalerLabelValueMeasured and returns early, preventing measured data from
being applied; invert this logic to only skip pods labeled as "measured" and
proceed for "normal" pods by changing the check to verify
pod.Labels[PodScalerLabelKey] == PodScalerLabelValueMeasured -> return,
otherwise continue (or explicitly check for PodScalerLabelValueNormal to apply
measured resources). Update the podScalerLabel/hasLabel branch accordingly so
measured pods are skipped and normal-labelled pods receive the cached measured
recommendations.
🧹 Nitpick comments (1)
cmd/pod-scaler/main.go (1)

290-290: Consider using an options struct to reduce parameter count.

The admit() call has 16+ positional parameters, making it difficult to read and prone to ordering errors.

Consider grouping related parameters into a struct:

type admitOptions struct {
    port, healthPort int
    certDir          string
    // ... other fields
    measuredPodsConfig *MeasuredPodsConfig
}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2b3e606 and 8c9bdcd.

📒 Files selected for processing (4)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (1)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (3)
cmd/pod-scaler/measured.go (3)

76-90: LGTM: Thread-safe cache implementation.

The MeasuredPodCache correctly uses RWMutex for concurrent read access while protecting writes. The Update method atomically replaces the entire map, which is appropriate for the refresh pattern.


277-327: Anti-affinity logic for pod isolation is well-structured.

The differentiation between measured pods (avoid ALL pod-scaler labeled pods) and normal pods (only avoid measured pods) correctly implements the isolation requirements. The use of RequiredDuringSchedulingIgnoredDuringExecution ensures hard anti-affinity.

Note: The overwrite issue at line 330 is addressed in a separate comment.


143-164: Verify BigQuery schema before merging.

Confirm that the ci_operator_metrics table contains the columns referenced in the query: max_cpu, max_memory, container_durations, and pod_scaler_label. If the actual schema differs, update the query and any dependent logic accordingly.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 8c9bdcd to 03ff253 Compare January 12, 2026 23:23
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 348-350: The current assignment overwrites any existing
PodAntiAffinity terms; instead of replacing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
with requiredTerms, merge by appending requiredTerms onto the existing slice:
ensure PodAntiAffinity and RequiredDuringSchedulingIgnoredDuringExecution are
non-nil, then set RequiredDuringSchedulingIgnoredDuringExecution =
append(existingSlice, requiredTerms...), so existing rules are preserved and new
ones added.

In @cmd/pod-scaler/producer.go:
- Around line 512-525: The PromQL in cpuQuery and memoryQuery is invalid because
max_over_time was given RFC3339 timestamps instead of a range vector; remove the
max_over_time wrapper and the timestamp window from the selector and instead
query the raw series over time (e.g., use
`rate(container_cpu_usage_seconds_total{namespace="%s",pod:"%s",container!="POD",container!=""}[3m])`
for CPU and `container_memory_working_set_bytes{...}` for memory) by formatting
without from.Format/until.Format, keep using client.QueryRange, and compute the
peak values in Go (the existing post-query min/max logic) rather than embedding
timestamps into PromQL.
- Around line 512-537: The PromQL queries (cpuQuery, memoryQuery,
containerInfoQuery) directly interpolate namespace/podName and are vulnerable to
query failures or injection from special characters; add a helper
escapePromQLValue(s string) that escapes backslashes, double quotes, and
newlines (e.g., replace `\` -> `\\`, `"` -> `\"`, `\n` -> `\\n`) and apply it to
namespace, podName, and any other label values before formatting the PromQL
strings so all fmt.Sprintf calls use escaped values.
🧹 Nitpick comments (3)
cmd/pod-scaler/producer.go (1)

547-607: Consider extracting duplicated min/max calculation logic.

The min/max calculation pattern is repeated for CPU (lines 553-579), memory (lines 582-607), node CPU (lines 627-645), and node memory (lines 660-678). This could be simplified with a helper function.

♻️ Example helper
func computeMinMax[T float64 | int64](values []T) (min, max T) {
	if len(values) == 0 {
		return
	}
	min, max = values[0], values[0]
	for _, v := range values[1:] {
		if v < min {
			min = v
		}
		if v > max {
			max = v
		}
	}
	return
}
cmd/pod-scaler/admission.go (2)

36-36: Function signature has 16 parameters; consider using an options struct.

The admit function has grown to 16 parameters, making it difficult to read and maintain. This is a good candidate for consolidation into a configuration struct.

♻️ Example refactor
type admitOptions struct {
	port, healthPort          int
	certDir                   string
	mutateResourceLimits      bool
	cpuCap                    int64
	memoryCap                 string
	cpuPriorityScheduling     int64
	authoritativeCPU          bool
	authoritativeMemory       bool
	enableMeasuredPods        bool
	bigQueryProjectID         string
	bigQueryDatasetID         string
	bigQueryCredentialsFile   string
}

func admit(opts admitOptions, client buildclientv1.BuildV1Interface, loaders map[string][]*cacheReloader, reporter results.PodScalerReporter) {
    // ...
}

321-321: Unused parameters authoritativeCPU and authoritativeMemory add complexity.

These parameters are passed through the entire call chain but explicitly ignored with blank identifiers. If they're planned for future use, consider adding a TODO comment. Otherwise, remove them to simplify the code.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8c9bdcd and 03ff253.

📒 Files selected for processing (4)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
  • Cache (24-28)
pkg/metrics/insights.go (1)
  • Context (38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (3)
  • BigQueryClient (94-101)
  • NewMeasuredPodCache (70-75)
  • NewBigQueryClient (104-137)
🔇 Additional comments (4)
cmd/pod-scaler/admission.go (1)

117-126: LGTM!

The measured pod handling flow is well-structured: classify the pod, add anti-affinity rules based on classification, then apply measured resources before the standard mutation. Each function handles its own early-return logic appropriately.

cmd/pod-scaler/measured.go (3)

103-137: LGTM!

The BigQuery client initialization is well-structured with proper error handling, initial refresh on startup, and scheduled daily refresh. The use of query labels for tracking is a good practice.


204-225: LGTM!

The two-phase JSON parsing for container_durations provides good backwards compatibility, first trying nanosecond integers and falling back to string duration format.


282-285: This behavior is intentional and tested—no issue here.

The code defaults to measuring all pods when BigQuery is unavailable by design. This is a conservative safety default: if BigQuery isn't configured, measuring everything ensures no pods slip through unmeasured. The comment at line 283 documents this intent, and the test suite explicitly verifies this behavior with "Pod with nil BigQuery client - should default to measured."

If this causes scheduling concerns in production, the issue is configuration: ensure enable-measured-pods is only used with valid BigQuery credentials, or disable the feature entirely.

Likely an incorrect or invalid review comment.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 03ff253 to 41d0e12 Compare January 13, 2026 02:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/producer.go:
- Around line 101-123: The BigQuery client bqClient is closed via defer inside
produce(), which returns immediately in non-once mode causing scheduled ticks
(execute/interrupts.TickLiteral) to use a closed client; instead register a
shutdown cleanup with interrupts.Run to close bqClient when the global interrupt
context is cancelled (use interrupts.Context()/interrupts.Run), or move bqClient
lifecycle to the caller so it outlives produce(); ensure the
collectMeasuredPodMetrics calls continue to use the live bqClient until the
registered shutdown handler closes it.
🧹 Nitpick comments (7)
cmd/pod-scaler/main.go (3)

74-81: Unused fields in consumerOptions.

The consumerOptions struct defines enableMeasuredPods, bigQueryProjectID, bigQueryDatasetID, and bigQueryCredentialsFile, but mainAdmission on line 287 uses producerOptions fields instead. Either remove these fields from consumerOptions or update mainAdmission to use the consumer-specific options.


88-93: Flags only bind to producerOptions.

These flags only bind to producerOptions, meaning consumerOptions fields remain empty. If consumer mode should support different BigQuery settings, add separate flag bindings. Otherwise, remove the duplicate fields from consumerOptions.


286-288: Clarify why producerOptions is used for consumer mode.

The comment explains the intent, but it's inconsistent with having separate consumerOptions fields. Consider removing the duplicate fields from consumerOptions to avoid confusion.

cmd/pod-scaler/producer.go (2)

567-626: Consider extracting min/max computation into a helper function.

The min/max computation logic is duplicated for CPU (lines 572-598) and memory (lines 600-626). Extracting this into a generic helper would reduce duplication.


800-800: Unused parameter projectID.

The projectID parameter is named with a blank identifier _, indicating it's unused. Either remove it from the signature or use it in the function body.

Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {
+func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {

And update the call site on line 480:

-if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil {
+if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {
cmd/pod-scaler/admission.go (2)

36-36: Consider using a config struct to reduce parameter count.

The admit function now has 16 parameters, making it hard to maintain. Consider grouping related parameters into configuration structs.


321-321: Unused authoritative parameters.

authoritativeCPU and authoritativeMemory parameters are declared with blank identifiers, indicating they're unused. If these are planned for future use, consider adding a TODO comment; otherwise, remove them from the signature.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 03ff253 and 41d0e12.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (3)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
  • Cache (24-28)
pkg/metrics/insights.go (1)
  • Context (38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (94-101)
  • NewMeasuredPodCache (70-75)
  • NewBigQueryClient (104-137)
  • ClassifyPod (265-295)
  • AddPodAntiAffinity (300-357)
  • ApplyMeasuredPodResources (363-437)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (2)
  • Metadata (161-166)
  • ResourceList (261-261)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (17)
cmd/pod-scaler/measured_test.go (4)

16-61: LGTM! Good coverage of ShouldBeMeasured edge cases.

The test correctly validates the three key scenarios: no data exists (should measure), recent data within 10 days (should not measure), and stale data beyond 10 days (should measure).


63-149: LGTM! ClassifyPod test cases are well-designed.

Good coverage of pod classification logic including the nil BigQuery client fallback to normal classification, which prevents overwhelming isolated nodes when BigQuery is unavailable.


151-210: LGTM! Anti-affinity tests verify scheduling isolation rules.

The test correctly validates that measured pods avoid all pod-scaler labeled pods, normal pods avoid only measured pods, and unlabeled pods receive no anti-affinity rules.


212-253: LGTM! Cache tests cover basic Get/Update operations.

The test validates cache miss, update, and subsequent cache hit with correct data retrieval.

cmd/pod-scaler/main.go (1)

262-263: LGTM! Producer options correctly passed to produce function.

The measured pods configuration is properly propagated.

cmd/pod-scaler/producer.go (3)

42-54: LGTM! PromQL label value escaping is thorough.

The function correctly escapes backslashes first (before quotes), then quotes, newlines, and carriage returns to prevent PromQL injection.


830-832: Verify BigQuery column alignment.

Created and LastMeasured both use pod.Timestamp. Ensure the BigQuery table schema has both columns and they should have the same value for initial writes.


426-427: Prometheus selector requires a label that doesn't appear to be set anywhere.

The selector {pod_scaler="measured",label_created_by_ci="true"} requires the label_created_by_ci label, but this label is not referenced anywhere else in the codebase—it's not defined as a constant, set on pods, or used in the BigQuery query that sources measured pod data (which only checks pod_scaler_label = 'measured'). This suggests the label may never be applied to measured pods, making the selector ineffective or incomplete. Verify whether label_created_by_ci should be removed from the selector or if the label application logic is missing.

cmd/pod-scaler/admission.go (2)

43-55: LGTM! BigQuery client initialization with proper validation.

The code correctly validates that BigQuery project and dataset IDs are provided when measured pods are enabled, and handles initialization failures appropriately.


117-126: LGTM! Measured pod handling correctly integrated into admission flow.

The ordering is correct: classify first, then add anti-affinity, then apply measured resources before regular resource mutation.

cmd/pod-scaler/measured.go (7)

24-40: LGTM! Well-documented constants with clear purpose.

The constants clearly define the pod classification labels, data retention period (10 days), and refresh interval (24 hours).


61-91: LGTM! Thread-safe cache implementation.

The cache correctly uses RWMutex for read/write synchronization. The Update method replaces the entire map, which is appropriate for a full cache refresh from BigQuery.


204-225: LGTM! Robust JSON parsing with fallback.

The code handles both int64 (nanoseconds) and string duration formats for backwards compatibility when parsing container durations from BigQuery.


265-295: LGTM! ClassifyPod correctly handles all scenarios.

The function properly initializes labels if nil, checks all containers for measurement needs, and defaults to normal when BigQuery is unavailable to prevent overwhelming isolated nodes.


300-357: LGTM! Anti-affinity rules correctly implemented.

Measured pods get complete isolation by avoiding all pod-scaler labeled pods, while normal pods only avoid measured pods. The code properly merges with existing anti-affinity terms rather than overwriting.


363-437: LGTM! Resource application logic is well-designed.

The function correctly:

  1. Only applies to "normal" pods (those with measured data, not those being measured)
  2. Finds the longest-running container as the main workload
  3. Falls back to first container if no duration data
  4. Applies 20% safety buffer to measured values

195-202: Potential metadata mismatch between cache key and lookup.

The FullMetadata reconstructed from BigQuery only populates Org, Repo, Branch, and Container. However, podscaler.MetadataFor (used in ClassifyPod and ApplyMeasuredPodResources) may also populate additional fields like Target, Step, and Pod. If the cache lookup mechanism compares complete metadata objects rather than just the key fields, this could result in cache misses when the same logical pod metric is queried multiple times.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch 2 times, most recently from 1fd5ab9 to 515ce35 Compare January 13, 2026 13:39
@deepsm007
Copy link
Contributor Author

/pipeline required

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 195-202: The FullMetadata constructed in measured.go (meta :=
podscaler.FullMetadata {...}) only sets Metadata and Container, causing cache
key mismatches with MetadataFor/ShouldBeMeasured which expect Pod and
Target/Step; update the BigQuery SELECT and the code that builds meta to include
and assign Target and Pod (and Step if available) from the query result (or
modify the query to join the pod label data) so the FullMetadata keys stored
from BigQuery match the fully-populated metadata used during admission (refer to
the meta variable construction and the MetadataFor/ShouldBeMeasured code paths).

In @cmd/pod-scaler/producer.go:
- Around line 654-656: The node-level CPU query currently includes a namespace
filter which causes it to miss workloads from other namespaces; update the
PromQL used to build nodeCPUQuery (and the similar query referenced around line
689) to remove the namespace selector so it only filters by node and container
(e.g., container!="POD",container!="" and node="%s") thereby capturing total CPU
usage on the node for isolation validation; locate the string construction that
sets nodeCPUQuery and the analogous variable near line 689 and remove the
namespace="%s" clause from the metric selector.
- Around line 442-446: The PromQL selector used for the kube_pod_labels query is
wrong: change the measuredPodSelector value from
`{pod_scaler="measured",label_created_by_ci="true"}` to use the label_ prefix
that kube-state-metrics exposes, e.g.
`{label_pod_scaler="measured",label_created_by_ci="true"}`, so the
podLabelsQuery (fmt.Sprintf(`kube_pod_labels%s`, measuredPodSelector)) will
match the actual kube_pod_labels metric labels.
🧹 Nitpick comments (5)
cmd/pod-scaler/measured_test.go (1)

151-210: Consider verifying the anti-affinity term details.

The test validates that anti-affinity is set, but doesn't assert on the specific label selector or operator used. This could miss regressions where the anti-affinity logic changes.

♻️ Optional: Add assertions for anti-affinity term details
 	// Test case 1: Measured pod should avoid all pod-scaler labeled pods
 	measuredPod := &corev1.Pod{
 		ObjectMeta: metav1.ObjectMeta{
 			Name: "measured-pod",
 			Labels: map[string]string{
 				PodScalerLabelKey: PodScalerLabelValueMeasured,
 			},
 		},
 		Spec: corev1.PodSpec{},
 	}

 	AddPodAntiAffinity(measuredPod, logger)
 	if measuredPod.Spec.Affinity == nil {
 		t.Fatal("Expected affinity to be set")
 	}
 	if measuredPod.Spec.Affinity.PodAntiAffinity == nil {
 		t.Fatal("Expected podAntiAffinity to be set")
 	}
-	if len(measuredPod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution) == 0 {
-		t.Fatal("Expected required anti-affinity terms to be set")
-	}
+	terms := measuredPod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
+	if len(terms) == 0 {
+		t.Fatal("Expected required anti-affinity terms to be set")
+	}
+	// Verify measured pods avoid ALL pod-scaler labeled pods (OpExists)
+	if terms[0].LabelSelector.MatchExpressions[0].Operator != metav1.LabelSelectorOpExists {
+		t.Errorf("Expected OpExists for measured pod, got %v", terms[0].LabelSelector.MatchExpressions[0].Operator)
+	}
cmd/pod-scaler/main.go (1)

81-86: Add validation for BigQuery flags when measured pods is enabled.

The flag definitions are correct, but validation that bigquery-project-id and bigquery-dataset-id are required when enable-measured-pods is true should be in validate(), not deferred to runtime in admission.go. Early validation provides better UX.

♻️ Add validation in options.validate()

Add this check in the validate() function around line 157:

if o.producerOptions.enableMeasuredPods {
	if o.producerOptions.bigQueryProjectID == "" {
		return errors.New("--bigquery-project-id is required when --enable-measured-pods is true")
	}
	if o.producerOptions.bigQueryDatasetID == "" {
		return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true")
	}
}
cmd/pod-scaler/producer.go (1)

816-856: Unused parameter projectID in writeMeasuredPodsToBigQuery.

The projectID parameter is explicitly ignored with _. If it's not needed (since bqClient already has the project context), consider removing it from the signature to reduce confusion.

♻️ Remove unused parameter
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {
+func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {

And update the call site at line 496:

-		if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil {
+		if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {
cmd/pod-scaler/admission.go (1)

321-321: Unused parameters authoritativeCPU and authoritativeMemory.

These parameters are passed through but not used in the function body. The summary mentions they were added as new fields, but the implementation ignores them with _. If these are placeholders for future work, consider adding a TODO comment.

♻️ Add TODO comment or remove unused parameters

If these are intended for future use:

func mutatePodResources(pod *corev1.Pod, server *resourceServer, mutateResourceLimits bool, cpuCap int64, memoryCap string, authoritativeCPU, authoritativeMemory bool /* TODO: implement authoritative mode */, reporter results.PodScalerReporter, logger *logrus.Entry) {

Otherwise, consider removing them until they're needed.

cmd/pod-scaler/measured.go (1)

398-406: Fallback to first container may not be ideal.

When no duration data exists, the code falls back to the first container. However, if the first container happens to be a sidecar (e.g., istio-proxy), resources would be applied to the wrong container. Consider logging a warning when using the fallback.

♻️ Add warning log for fallback case
 	// If we don't have duration data, just use the first container as a fallback.
 	if longestContainer == nil && len(pod.Spec.Containers) > 0 {
 		longestContainer = &pod.Spec.Containers[0]
+		logger.Warnf("No duration data available, falling back to first container: %s", longestContainer.Name)
 	}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 41d0e12 and 515ce35.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (4)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
  • Cache (24-28)
pkg/metrics/insights.go (1)
  • Context (38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (94-101)
  • NewMeasuredPodCache (70-75)
  • NewBigQueryClient (104-137)
  • ClassifyPod (265-295)
  • AddPodAntiAffinity (300-357)
  • ApplyMeasuredPodResources (363-437)
cmd/pod-scaler/measured_test.go (2)
cmd/pod-scaler/measured.go (8)
  • NewMeasuredPodCache (70-75)
  • BigQueryClient (94-101)
  • MeasuredPodData (44-59)
  • ClassifyPod (265-295)
  • PodScalerLabelKey (27-27)
  • PodScalerLabelValueMeasured (33-33)
  • PodScalerLabelValueNormal (30-30)
  • AddPodAntiAffinity (300-357)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (15)
cmd/pod-scaler/measured_test.go (3)

16-61: LGTM - Good test coverage for ShouldBeMeasured.

The test covers the three key scenarios: no data (should measure), recent data (should not measure), and stale data (should measure). The time boundaries align with the 10-day retention constant.


63-149: LGTM - ClassifyPod test covers essential paths.

Test cases appropriately cover:

  1. Pod with no cached data → measured classification
  2. Pod with recent data → normal classification
  3. Nil BigQuery client → defaults to normal (graceful degradation)

The use of MetadataFor in test case 2 ensures the metadata construction matches what happens in production.


212-253: LGTM - Cache test covers basic Get/Update operations.

The test validates cache miss, update, and retrieval with correct field values.

cmd/pod-scaler/main.go (3)

54-62: LGTM - producerOptions struct extended appropriately.

The new fields for measured pods configuration are logically grouped with other producer settings.


255-256: LGTM - Producer invocation updated with new parameters.

The parameters are passed correctly to the produce function.


279-281: LGTM - Admission invocation updated with measured pods config.

The comment clarifies the shared use of producerOptions, and the parameters are passed correctly.

cmd/pod-scaler/producer.go (2)

42-54: LGTM - PromQL escaping prevents injection.

The escaping handles the critical characters (backslash, quote, newline, carriage return) in the correct order.


101-139: LGTM - BigQuery client lifecycle is well-managed.

Good use of sync.Once for cleanup and interrupts.OnInterrupt for graceful shutdown. The comment explains the rationale for closing in once mode vs non-once mode.

cmd/pod-scaler/admission.go (2)

43-55: LGTM - BigQuery client initialization with proper validation.

The validation for required flags when enableMeasuredPods is true is appropriate here as a fatal error during admission startup.


117-126: LGTM - Measured pod logic integration is well-structured.

The order of operations is correct:

  1. Classify pod (set label)
  2. Add anti-affinity (based on label)
  3. Apply measured resources (for "normal" pods with data)
  4. Then apply regular resource mutation

This ensures measured data takes precedence before histogram-based recommendations are applied.

cmd/pod-scaler/measured.go (5)

24-40: LGTM - Well-documented constants.

The constants are appropriately named and documented with clear explanations of their purpose.


85-91: Cache.Update replaces the entire map atomically.

The current implementation replaces the entire cache map on update. This is safe but means any concurrent Get calls see either the old or new data entirely. If incremental updates are needed in the future, consider a merge strategy.


146-163: BigQuery query uses parameterized values safely.

The query uses fmt.Sprintf with values that come from struct fields (projectID, datasetID) and constants (MeasuredPodDataRetentionDays), not user input. This is safe from injection.


349-356: LGTM - Anti-affinity terms are merged correctly.

The implementation properly appends to existing terms rather than overwriting, preserving any pre-existing anti-affinity rules on the pod.


422-428: LGTM - Resource application with 20% buffer.

The 20% buffer on measured CPU/memory aligns with the existing useOursIfLarger logic in admission.go (line 246), providing consistency in safety margins.

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 515ce35 to 8fe3660 Compare January 13, 2026 15:19
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @cmd/pod-scaler/producer.go:
- Around line 42-54: The escapePromQLLabelValue function currently escapes
backslashes, double quotes, newlines and carriage returns but omits tab
characters; update escapePromQLLabelValue to also replace tab characters by
adding a replacement of "\t" with `\t` (ensure this replacement is applied
alongside the existing strings.ReplaceAll calls in the function so tabs are
properly escaped in PromQL label values).
🧹 Nitpick comments (6)
cmd/pod-scaler/main.go (1)

81-86: Consider adding validation for BigQuery flags when measured pods are enabled.

The flag bindings are correct, but there's no validation in validate() to ensure bigquery-project-id and bigquery-dataset-id are provided when enable-measured-pods is true for producer mode. The admission mode validates this (in admission.go line 45-47), but the producer mode relies on the condition check in producer.go line 104 which silently skips BigQuery setup if IDs are empty.

This asymmetry could lead to confusion where measured pods is "enabled" but silently does nothing in producer mode.

Suggested validation in validate()
 func (o *options) validate() error {
 	switch o.mode {
 	case "producer":
-		return o.kubernetesOptions.Validate(false)
+		if err := o.kubernetesOptions.Validate(false); err != nil {
+			return err
+		}
+		if o.producerOptions.enableMeasuredPods {
+			if o.producerOptions.bigQueryProjectID == "" || o.producerOptions.bigQueryDatasetID == "" {
+				return errors.New("--bigquery-project-id and --bigquery-dataset-id are required when --enable-measured-pods is true")
+			}
+		}
 	case "consumer.ui":
cmd/pod-scaler/producer.go (2)

529-800: Consider extracting helper functions from this large function.

extractMeasuredPodData is ~270 lines and handles multiple responsibilities:

  1. Pod metadata extraction
  2. CPU metrics collection and min/max calculation
  3. Memory metrics collection and min/max calculation
  4. Container duration extraction
  5. Node-level validation metrics
  6. Record assembly

The min/max calculation loops (lines 591-617 and 619-645) are nearly identical and could be extracted.

Example helper for min/max extraction
// findMinMax extracts min and max values from a Prometheus matrix for a given value extractor
func findMinMax[T constraints.Ordered](matrix model.Matrix, containerKey string, getValue func(model.SampleValue) T) (map[string]T, map[string]T) {
	minByContainer := make(map[string]T)
	maxByContainer := make(map[string]T)
	for _, stream := range matrix {
		container := string(stream.Metric[containerKey])
		var minVal, maxVal T
		first := true
		for _, sample := range stream.Values {
			val := getValue(sample.Value)
			if first {
				minVal, maxVal = val, val
				first = false
			} else {
				if val < minVal {
					minVal = val
				}
				if val > maxVal {
					maxVal = val
				}
			}
		}
		// Only record if we have data
		if !first {
			minByContainer[container] = minVal
			maxByContainer[container] = maxVal
		}
	}
	return minByContainer, maxByContainer
}

824-866: Unused parameter should be removed.

The projectID parameter is explicitly ignored with _. If it's not needed, remove it from the function signature.

Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {
+func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {

And update the call site at line 497:

-		if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil {
+		if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {
cmd/pod-scaler/admission.go (1)

36-66: Function has too many parameters - consider using an options struct.

The admit() function now has 16 parameters, which makes it difficult to maintain and call correctly. Consider grouping related parameters into configuration structs.

Suggested refactor using options struct
type admitOptions struct {
	port                        int
	healthPort                  int
	certDir                     string
	mutateResourceLimits        bool
	cpuCap                      int64
	memoryCap                   string
	cpuPriorityScheduling       int64
	authoritativeCPURequests    bool
	authoritativeMemoryRequests bool
	enableMeasuredPods          bool
	bigQueryProjectID           string
	bigQueryDatasetID           string
	bigQueryCredentialsFile     string
}

func admit(opts admitOptions, client buildclientv1.BuildV1Interface, loaders map[string][]*cacheReloader, reporter results.PodScalerReporter) {
    // ...
}
cmd/pod-scaler/measured.go (2)

146-165: Use parameterized queries to prevent SQL injection risks.

While the values being interpolated (projectID, datasetID, MeasuredPodDataRetentionDays) are from trusted configuration, using string interpolation in SQL queries is a risky pattern. BigQuery supports parameterized queries that would be safer.

Suggested safer query construction
-	query := bq.client.Query(fmt.Sprintf(`
+	queryStr := `
 		SELECT
 			org,
 			repo,
 			branch,
 			target,
 			container,
 			pod_name,
 			MAX(max_cpu) as max_cpu,
 			MAX(max_memory) as max_memory,
 			MAX(created) as last_measured,
 			ANY_VALUE(container_durations) as container_durations
 		FROM
-			`+"`%s.%s.ci_operator_metrics`"+`
+			` + fmt.Sprintf("`%s.%s.ci_operator_metrics`", bq.projectID, bq.datasetID) + `
 		WHERE
 			pod_scaler_label = 'measured'
-			AND created >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL %d DAY)
+			AND created >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL @retention_days DAY)
 		GROUP BY
 			org, repo, branch, target, container, pod_name
-	`, bq.projectID, bq.datasetID, MeasuredPodDataRetentionDays))
+	`
+	query := bq.client.Query(queryStr)
+	query.Parameters = []bigquery.QueryParameter{
+		{Name: "retention_days", Value: MeasuredPodDataRetentionDays},
+	}

Note: Table names cannot be parameterized in BigQuery, so the project/dataset interpolation must remain as-is.


404-421: Fallback to first container when no duration data is reasonable.

If no container duration data exists, using the first container as fallback is a sensible default. However, this means the first container might get resource increases even if it's a sidecar.

Consider logging at info/warn level when falling back, so operators can identify pods with missing duration data.

Suggested improvement
 	// If we don't have duration data, just use the first container as a fallback.
 	if longestContainer == nil && len(pod.Spec.Containers) > 0 {
 		longestContainer = &pod.Spec.Containers[0]
+		logger.Infof("No container duration data available, falling back to first container: %s", longestContainer.Name)
 	}
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 515ce35 and 8fe3660.

📒 Files selected for processing (5)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
✅ Files skipped from review due to trivial changes (1)
  • cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
  • BigQueryClient (94-101)
  • NewMeasuredPodCache (70-75)
  • NewBigQueryClient (104-137)
  • ClassifyPod (271-301)
  • AddPodAntiAffinity (306-363)
  • ApplyMeasuredPodResources (369-443)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (15)
cmd/pod-scaler/main.go (3)

54-62: New BigQuery configuration fields look correct.

The producerOptions struct is cleanly extended with the measured pods configuration. The field naming is consistent and clear.


255-256: Function call updated correctly for new parameters.

The produce() invocation properly passes all new BigQuery configuration fields from producerOptions.


279-281: Admission wiring is correct.

The admit() call correctly passes the measured pods configuration including BigQuery credentials. The comment explaining shared use of producerOptions is helpful.

cmd/pod-scaler/producer.go (2)

101-139: BigQuery client lifecycle management is well-designed.

The use of sync.Once for client cleanup and registering with interrupts.OnInterrupt ensures proper resource management. The handling of both once-mode and continuous-mode is correct.


433-506: Measured pod collection logic is sound.

The function correctly:

  • Uses a 4-hour query window for recently completed pods
  • Deduplicates pods by namespace/name
  • Handles errors gracefully per-cluster without failing entirely
  • Batches all data before writing to BigQuery

One observation: the query window of 4 hours might miss pods that ran longer if the collection runs every 2 hours. This seems intentional for catching "recently completed" pods.

cmd/pod-scaler/admission.go (3)

43-55: BigQuery client initialization with validation is correct.

The validation ensures BigQuery IDs are provided when measured pods is enabled, and properly creates the client with cache. Good error handling with logrus.Fatal for configuration errors.


117-127: Measured pod handling flow is well-structured.

The order of operations is correct:

  1. ClassifyPod - determine if pod needs measurement
  2. AddPodAntiAffinity - set scheduling constraints
  3. ApplyMeasuredPodResources - apply measured resource requests
  4. mutatePodResources - apply regular pod-scaler recommendations

This ensures measured data is applied before the regular mutation logic.


321-340: Remove unused parameters authoritativeCPU and authoritativeMemory from the function signature.

These parameters are passed to the function at line 126 but explicitly ignored with _ comments. Either remove them from the function signature and update the call site, or implement their intended functionality if needed.

cmd/pod-scaler/measured.go (7)

24-40: Constants are well-documented and appropriately scoped.

The constants clearly explain their purpose. The 10-day retention and 24-hour refresh interval seem reasonable for the use case.


61-91: Thread-safe cache implementation is correct.

Good use of sync.RWMutex for concurrent access. The Update() method atomically replaces the entire map, which is safe.


103-137: BigQuery client initialization starts refresh in constructor.

The constructor performs an initial refresh and schedules periodic refreshes. Note that if the initial refresh fails, it only logs a warning and continues - this is intentional to allow startup even if BigQuery is temporarily unavailable, but it means the cache will be empty until the next successful refresh.


248-261: ShouldBeMeasured logic is clear and correct.

Returns true if:

  1. No existing measurement data exists, OR
  2. The last measurement is older than 10 days

This aligns with the PR objective of re-measuring stale data.


306-363: Anti-affinity rules correctly implement isolation requirements.

The logic is sound:

  • Measured pods avoid ALL pods with the pod-scaler label (complete isolation)
  • Normal pods avoid only measured pods (allows normal pods to share nodes)
  • Existing anti-affinity terms are preserved via merge

This ensures measured pods get dedicated nodes for accurate resource measurement.


386-402: Loop continues after finding a container with longer duration.

The loop iterates through all containers even after finding a longer-running one. This is correct behavior since we need to check all containers to find the actual longest, but the early continue on line 392 when !exists is efficient.


428-442: Resource application with 20% safety buffer is consistent with existing code.

The 20% buffer (multiplier of 1.2) matches the existing pattern in useOursIfLarger (admission.go line 246). Good consistency in approach.

Comment on lines +42 to +53
// escapePromQLLabelValue escapes special characters in PromQL label values.
// PromQL label values in selectors must be quoted and have backslashes and quotes escaped.
func escapePromQLLabelValue(value string) string {
// Replace backslashes first (before replacing quotes)
value = strings.ReplaceAll(value, `\`, `\\`)
// Escape double quotes
value = strings.ReplaceAll(value, `"`, `\"`)
// Escape newlines
value = strings.ReplaceAll(value, "\n", `\n`)
// Escape carriage returns
value = strings.ReplaceAll(value, "\r", `\r`)
return value
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing tab character escape in PromQL label value sanitization.

The function escapes backslashes, quotes, newlines, and carriage returns, but misses tab characters (\t) which are also problematic in PromQL label values.

Suggested fix
 func escapePromQLLabelValue(value string) string {
 	// Replace backslashes first (before replacing quotes)
 	value = strings.ReplaceAll(value, `\`, `\\`)
 	// Escape double quotes
 	value = strings.ReplaceAll(value, `"`, `\"`)
 	// Escape newlines
 	value = strings.ReplaceAll(value, "\n", `\n`)
 	// Escape carriage returns
 	value = strings.ReplaceAll(value, "\r", `\r`)
+	// Escape tabs
+	value = strings.ReplaceAll(value, "\t", `\t`)
 	return value
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// escapePromQLLabelValue escapes special characters in PromQL label values.
// PromQL label values in selectors must be quoted and have backslashes and quotes escaped.
func escapePromQLLabelValue(value string) string {
// Replace backslashes first (before replacing quotes)
value = strings.ReplaceAll(value, `\`, `\\`)
// Escape double quotes
value = strings.ReplaceAll(value, `"`, `\"`)
// Escape newlines
value = strings.ReplaceAll(value, "\n", `\n`)
// Escape carriage returns
value = strings.ReplaceAll(value, "\r", `\r`)
return value
}
// escapePromQLLabelValue escapes special characters in PromQL label values.
// PromQL label values in selectors must be quoted and have backslashes and quotes escaped.
func escapePromQLLabelValue(value string) string {
// Replace backslashes first (before replacing quotes)
value = strings.ReplaceAll(value, `\`, `\\`)
// Escape double quotes
value = strings.ReplaceAll(value, `"`, `\"`)
// Escape newlines
value = strings.ReplaceAll(value, "\n", `\n`)
// Escape carriage returns
value = strings.ReplaceAll(value, "\r", `\r`)
// Escape tabs
value = strings.ReplaceAll(value, "\t", `\t`)
return value
}
🤖 Prompt for AI Agents
In @cmd/pod-scaler/producer.go around lines 42 - 54, The escapePromQLLabelValue
function currently escapes backslashes, double quotes, newlines and carriage
returns but omits tab characters; update escapePromQLLabelValue to also replace
tab characters by adding a replacement of "\t" with `\t` (ensure this
replacement is applied alongside the existing strings.ReplaceAll calls in the
function so tabs are properly escaped in PromQL label values).

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@deepsm007 deepsm007 force-pushed the pod-scaler-measured-pods branch from 8fe3660 to bfc4f68 Compare January 14, 2026 13:59
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@cmd/pod-scaler/admission.go`:
- Around line 63-64: The AddPodAntiAffinity merge incorrectly appends
requiredTerms to itself; in the logic where existingTerms is read from
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
replace the buggy append(requiredTerms, requiredTerms...) with
append(requiredTerms, existingTerms...) so existing terms are merged into
requiredTerms (ensure you reference AddPodAntiAffinity and the local variables
requiredTerms and existingTerms when making the change).
♻️ Duplicate comments (3)
cmd/pod-scaler/producer.go (1)

41-53: Missing tab character escape in PromQL label value sanitization.

Per a previous review comment that wasn't fully addressed, the function escapes backslashes, quotes, newlines, and carriage returns, but misses tab characters (\t) which can also be problematic in PromQL label values.

Suggested fix
 func escapePromQLLabelValue(value string) string {
 	// Replace backslashes first (before replacing quotes)
 	value = strings.ReplaceAll(value, `\`, `\\`)
 	// Escape double quotes
 	value = strings.ReplaceAll(value, `"`, `\"`)
 	// Escape newlines
 	value = strings.ReplaceAll(value, "\n", `\n`)
 	// Escape carriage returns
 	value = strings.ReplaceAll(value, "\r", `\r`)
+	// Escape tabs
+	value = strings.ReplaceAll(value, "\t", `\t`)
 	return value
 }
cmd/pod-scaler/measured.go (2)

96-104: Add a Close() method to release the underlying BigQuery client.

The BigQueryClient struct holds a *bigquery.Client but doesn't expose a method to close it. This can leave resources open for the application's lifetime. Add a Close() method:

func (bq *BigQueryClient) Close() error {
    return bq.client.Close()
}

Then ensure callers invoke it during shutdown.


127-130: Consider retry logic or a "not ready" state for initial BigQuery refresh failure.

If BigQuery is unavailable at startup, Refresh() fails with only a warning. The cache remains empty, so all pods will initially be classified based on the absence of data. For ShouldBeMeasured(), this returns true for all pods (since no data exists), potentially overwhelming isolated nodes.

Consider either:

  1. Implementing retry logic with backoff before proceeding
  2. Tracking a "ready" state and defaulting to a safe behavior until refresh succeeds
  3. Failing fast if the feature is enabled but BigQuery isn't functional
🧹 Nitpick comments (3)
cmd/pod-scaler/main.go (1)

255-269: Variable shadowing: cache shadows the outer Cache variable.

The variable cache on line 260 shadows the cache Cache declared on line 185. While this doesn't cause a bug (since the outer cache is already used before this block), it reduces readability. Consider renaming to measuredPodCache for clarity.

Suggested rename
-		cache := NewMeasuredPodCache(logrus.WithField("component", "measured-pods-producer"))
+		measuredPodCache := NewMeasuredPodCache(logrus.WithField("component", "measured-pods-producer"))
 		var err error
-		bqClient, err = NewBigQueryClient(opts.producerOptions.bigQueryProjectID, opts.producerOptions.bigQueryDatasetID, opts.producerOptions.bigQueryCredentialsFile, cache, logrus.WithField("component", "measured-pods-producer"))
+		bqClient, err = NewBigQueryClient(opts.producerOptions.bigQueryProjectID, opts.producerOptions.bigQueryDatasetID, opts.producerOptions.bigQueryCredentialsFile, measuredPodCache, logrus.WithField("component", "measured-pods-producer"))
cmd/pod-scaler/producer.go (1)

838-838: Unused parameter projectID in function signature.

The projectID parameter is declared but explicitly ignored with _. Since the BigQuery client already has the project context from initialization, this parameter is redundant. Consider removing it from the signature.

Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {
+func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {

And update the call site at line 485:

-		if err := writeMeasuredPodsToBigQuery(ctx, bqClient.client, bqClient.projectID, bqClient.datasetID, allMeasuredPods, logger); err != nil {
+		if err := writeMeasuredPodsToBigQuery(ctx, bqClient.client, bqClient.datasetID, allMeasuredPods, logger); err != nil {
cmd/pod-scaler/measured.go (1)

405-416: Consider logging when falling back to first container.

The fallback path silently uses the first container with measured data when duration data is unavailable. Adding a debug log here would help with troubleshooting when resource recommendations aren't applied as expected.

💡 Suggested enhancement
 	// If we don't have duration data, just use the first container with measured data as a fallback.
 	if longestContainer == nil && len(pod.Spec.Containers) > 0 {
+		logger.Debug("No container duration data available, falling back to first container with measured data")
 		for i := range pod.Spec.Containers {
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 8fe3660 and bfc4f68.

📒 Files selected for processing (6)
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/main.go
  • cmd/pod-scaler/measured.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
  • pkg/manifestpusher/manifestpusher.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • cmd/pod-scaler/main.go
  • pkg/manifestpusher/manifestpusher.go
  • cmd/pod-scaler/measured_test.go
  • cmd/pod-scaler/producer.go
  • cmd/pod-scaler/admission.go
  • cmd/pod-scaler/measured.go
🧬 Code graph analysis (3)
cmd/pod-scaler/main.go (1)
cmd/pod-scaler/measured.go (3)
  • BigQueryClient (97-104)
  • NewMeasuredPodCache (73-78)
  • NewBigQueryClient (107-140)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (4)
  • BigQueryClient (97-104)
  • ClassifyPod (274-301)
  • AddPodAntiAffinity (306-362)
  • ApplyMeasuredPodResources (368-443)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
  • Metadata (161-166)
pkg/pod-scaler/types.go (2)
  • FullMetadata (374-385)
  • MetadataFor (203-208)
🔇 Additional comments (16)
pkg/manifestpusher/manifestpusher.go (1)

44-56: LGTM!

Architecture extraction from nodeSelector with proper validation, and the image reference construction logic is correct.

cmd/pod-scaler/measured_test.go (4)

16-61: LGTM!

The test cases cover the key scenarios for ShouldBeMeasured: no data (should measure), recent data (should not measure), and stale data (should measure). The threshold logic is properly tested.


63-149: LGTM!

Test case 2 correctly uses MetadataFor to construct metadata that matches production code behavior (addressing the previous review concern). The three test cases cover: no data (measured), recent data (normal), and nil BigQuery client (normal fallback).


151-210: LGTM!

Anti-affinity tests verify the expected behavior for measured pods (avoid all pod-scaler pods), normal pods (avoid measured pods), and unlabeled pods (no anti-affinity).


212-253: LGTM!

Cache tests verify basic Get/Update operations and field values.

cmd/pod-scaler/main.go (1)

293-308: LGTM!

The admission path correctly initializes the BigQuery client when the measured pods feature is enabled, and passes it to the admit function.

cmd/pod-scaler/admission.go (2)

101-110: LGTM!

The measured pods integration flow is correctly ordered: classify the pod first, then add anti-affinity rules, then apply measured resources before the standard mutation path. This ensures measured pod resources are set before mutatePodResources runs.


36-47: LGTM!

The admit function signature and podMutator struct are correctly extended to include the BigQuery client for measured pods support.

cmd/pod-scaler/producer.go (4)

89-127: LGTM!

The BigQuery client lifecycle management is well implemented:

  • Uses sync.Once to ensure the client is closed exactly once
  • Registers cleanup via interrupts.OnInterrupt for graceful shutdown
  • In once mode, explicitly closes after work completes

This addresses the previous review concern about premature client closure.


421-494: LGTM!

The collectMeasuredPodMetrics function correctly:

  • Uses the proper label_pod_scaler prefix for kube_pod_labels queries
  • Deduplicates pods by namespace/name
  • Aggregates data from all clusters before writing to BigQuery

554-589: LGTM!

The PromQL queries are now correctly formed without the invalid max_over_time wrapper with timestamps. The code properly uses QueryRange and computes min/max from the returned samples in Go.


661-735: LGTM!

The node-level metrics queries now correctly omit the namespace filter, capturing total node utilization for isolation validation as suggested in previous reviews.

cmd/pod-scaler/measured.go (4)

251-269: LGTM!

The ShouldBeMeasured logic correctly identifies pods needing measurement (no data or stale data beyond retention period). The implementation is clean and the cutoff calculation is correct.


271-301: LGTM!

The classification logic correctly defaults to "normal" when BigQuery is unavailable, preventing isolated nodes from being overwhelmed. The per-container check with early exit is efficient.


303-362: LGTM!

The anti-affinity logic now correctly merges with existing terms rather than overwriting them. The differentiated rules for measured vs normal pods ensure proper isolation.


386-403: Pod name in cache key may cause lookup failures for pods with instance-specific names.

The cache lookup includes pod.ObjectMeta.Name in the FullMetadata key (via MetadataFor). The code already recognizes this issue for certain pod types—RPM repo and ProwJob pods have their pod names cleared from the key since those names are dynamically generated or irrelevant.

However, regular measured pods retain the pod name in the cache key. If pod instance names change between measurement runs (common in CI environments where pod names include hashes or timestamps), lookups will fail to find historical data stored under previous pod instance names.

Consider whether regular measured pod metadata should also exclude the pod instance name to match measured data across pod lifecycle variations, consistent with how the code handles RPM repo and ProwJob pods.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +63 to 64
bqClient *BigQueryClient
reporter results.PodScalerReporter
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug in AddPodAntiAffinity referenced from this file: existing terms are duplicated instead of merged.

In cmd/pod-scaler/measured.go lines 355-357 (used by this admission flow), there's a bug in merging anti-affinity terms:

existingTerms := pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
if existingTerms != nil {
    requiredTerms = append(requiredTerms, requiredTerms...)  // BUG: appends requiredTerms to itself
}

This should append existingTerms to requiredTerms, not requiredTerms to itself.

Proposed fix in measured.go
 	// Merge with existing anti-affinity terms instead of overwriting
 	existingTerms := pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
 	if existingTerms != nil {
-		requiredTerms = append(requiredTerms, requiredTerms...)
+		requiredTerms = append(existingTerms, requiredTerms...)
 	}
 	pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution = requiredTerms
🤖 Prompt for AI Agents
In `@cmd/pod-scaler/admission.go` around lines 63 - 64, The AddPodAntiAffinity
merge incorrectly appends requiredTerms to itself; in the logic where
existingTerms is read from
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
replace the buggy append(requiredTerms, requiredTerms...) with
append(requiredTerms, existingTerms...) so existing terms are merged into
requiredTerms (ensure you reference AddPodAntiAffinity and the local variables
requiredTerms and existingTerms when making the change).

Comment on lines +79 to +108
err = wait.ExponentialBackoff(backoff, func() (bool, error) {
digest, length, err = registry.PushManifestList(
"", // username: we don't we use basic auth
"", // password: "
types.YAMLInput{Image: targetImage, Manifests: srcImages},
false, // --ignore-missing. We don't want to ignore missing images.
true, // --insecure to allow pushing to the local registry.
false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry.
types.Docker, // we only need docker type manifest.
m.dockercfgPath,
)
if err != nil {
// Check if the error indicates missing images (common race condition)
errStr := err.Error()
if strings.Contains(errStr, "no image found in manifest list") ||
strings.Contains(errStr, "inspect of image") ||
strings.Contains(errStr, "failed to pull image") ||
strings.Contains(errStr, "choosing an image from manifest list") ||
strings.Contains(errStr, "PullBuilderImageFailed") {
m.logger.Warnf("Images not yet available in registry, retrying: %v", err)
return false, nil // Retry
}
// For other errors, fail immediately
return false, err
}
return true, nil
})

if err != nil {
return err
return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Retry error handling loses the underlying cause on timeout.

When wait.ExponentialBackoff exhausts all retries, it returns wait.ErrWaitTimeout instead of the last error from the condition function. This means the error message at line 108 will say "failed to push manifest list... context deadline exceeded" rather than the actual registry error that caused retries.

Proposed fix to preserve the last error
 	var digest string
 	var length int
 	var err error
+	var lastErr error

 	err = wait.ExponentialBackoff(backoff, func() (bool, error) {
-		digest, length, err = registry.PushManifestList(
+		digest, length, lastErr = registry.PushManifestList(
 			"", // username: we don't we use basic auth
 			"", // password:             "
 			types.YAMLInput{Image: targetImage, Manifests: srcImages},
 			false,        // --ignore-missing. We don't want to ignore missing images.
 			true,         // --insecure to allow pushing to the local registry.
 			false,        // --plain-http is false by default in manifest-tool. False for the OpenShift registry.
 			types.Docker, // we only need docker type manifest.
 			m.dockercfgPath,
 		)
-		if err != nil {
+		if lastErr != nil {
 			// Check if the error indicates missing images (common race condition)
-			errStr := err.Error()
+			errStr := lastErr.Error()
 			if strings.Contains(errStr, "no image found in manifest list") ||
 				strings.Contains(errStr, "inspect of image") ||
 				strings.Contains(errStr, "failed to pull image") ||
 				strings.Contains(errStr, "choosing an image from manifest list") ||
 				strings.Contains(errStr, "PullBuilderImageFailed") {
-				m.logger.Warnf("Images not yet available in registry, retrying: %v", err)
+				m.logger.Warnf("Images not yet available in registry, retrying: %v", lastErr)
 				return false, nil // Retry
 			}
 			// For other errors, fail immediately
-			return false, err
+			return false, lastErr
 		}
 		return true, nil
 	})

 	if err != nil {
-		return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err)
+		if errors.Is(err, wait.ErrWaitTimeout) && lastErr != nil {
+			return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, lastErr)
+		}
+		return fmt.Errorf("failed to push manifest list for %s: %w", targetImageRef, err)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
err = wait.ExponentialBackoff(backoff, func() (bool, error) {
digest, length, err = registry.PushManifestList(
"", // username: we don't we use basic auth
"", // password: "
types.YAMLInput{Image: targetImage, Manifests: srcImages},
false, // --ignore-missing. We don't want to ignore missing images.
true, // --insecure to allow pushing to the local registry.
false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry.
types.Docker, // we only need docker type manifest.
m.dockercfgPath,
)
if err != nil {
// Check if the error indicates missing images (common race condition)
errStr := err.Error()
if strings.Contains(errStr, "no image found in manifest list") ||
strings.Contains(errStr, "inspect of image") ||
strings.Contains(errStr, "failed to pull image") ||
strings.Contains(errStr, "choosing an image from manifest list") ||
strings.Contains(errStr, "PullBuilderImageFailed") {
m.logger.Warnf("Images not yet available in registry, retrying: %v", err)
return false, nil // Retry
}
// For other errors, fail immediately
return false, err
}
return true, nil
})
if err != nil {
return err
return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err)
var digest string
var length int
var err error
var lastErr error
err = wait.ExponentialBackoff(backoff, func() (bool, error) {
digest, length, lastErr = registry.PushManifestList(
"", // username: we don't we use basic auth
"", // password: "
types.YAMLInput{Image: targetImage, Manifests: srcImages},
false, // --ignore-missing. We don't want to ignore missing images.
true, // --insecure to allow pushing to the local registry.
false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry.
types.Docker, // we only need docker type manifest.
m.dockercfgPath,
)
if lastErr != nil {
// Check if the error indicates missing images (common race condition)
errStr := lastErr.Error()
if strings.Contains(errStr, "no image found in manifest list") ||
strings.Contains(errStr, "inspect of image") ||
strings.Contains(errStr, "failed to pull image") ||
strings.Contains(errStr, "choosing an image from manifest list") ||
strings.Contains(errStr, "PullBuilderImageFailed") {
m.logger.Warnf("Images not yet available in registry, retrying: %v", lastErr)
return false, nil // Retry
}
// For other errors, fail immediately
return false, lastErr
}
return true, nil
})
if err != nil {
if errors.Is(err, wait.ErrWaitTimeout) && lastErr != nil {
return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, lastErr)
}
return fmt.Errorf("failed to push manifest list for %s: %w", targetImageRef, err)
}

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2026

@deepsm007: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images bfc4f68 link true /test images
ci/prow/breaking-changes bfc4f68 link false /test breaking-changes

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants