-
Notifications
You must be signed in to change notification settings - Fork 292
DPTP-4613: pod-scaler implement measured pods for accurate resource measurement #4887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@deepsm007: This pull request references DPTP-4613 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughThis PR integrates measured pod support into the pod-scaler system by adding BigQuery-backed data collection and caching. It extends the admission and producer flows to classify pods, apply anti-affinity rules, and adjust resource constraints based on measured utilization. A retry mechanism is added to manifest pushing for reliability. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes ✨ Finishing touches
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deepsm007 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI Agents
In @cmd/pod-scaler/measured.go:
- Around line 330-332: The current assignment to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
overwrites any existing anti-affinity rules; instead, merge by appending
requiredTerms to any existing slice: ensure pod.Spec.Affinity and
pod.Spec.Affinity.PodAntiAffinity are non-nil, read the existing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
slice (if non-nil), append the requiredTerms elements to it, and then set the
field to the combined slice so existing rules are preserved.
- Around line 197-215: The FullMetadata used to build MeasuredPodData is missing
Variant, Target, Pod, and Step which will cause cache misses in
MetadataFor()/ClassifyPod; update the BigQuery SQL to select variant, target,
pod, and step, add those fields to the row struct used in
cmd/pod-scaler/measured.go, and populate podscaler.FullMetadata (Variant,
Target, Pod, Step) from the row when constructing meta so MeasuredPodData
contains the same metadata keys as MetadataFor().
🧹 Nitpick comments (4)
cmd/pod-scaler/main.go (2)
112-161: Add validation for measured pods flags inoptions.validate().When
--enable-measured-podsis true, the BigQuery project and dataset IDs should be validated upfront inoptions.validate()rather than at runtime inadmit(). This provides fail-fast behavior and clearer error messages during startup.🔎 Proposed fix
Add the following validation within the
case "consumer.admission":block (around line 138):if err := o.resultsOptions.Validate(); err != nil { return err } + if o.enableMeasuredPods { + if o.bigQueryProjectID == "" { + return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") + } + if o.bigQueryDatasetID == "" { + return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") + } + } default:
282-282: Consider grouping parameters into a configuration struct.The
admit()function now has 16 parameters, making it harder to read and maintain. Consider grouping related parameters into a configuration struct.cmd/pod-scaler/admission.go (1)
68-81: Minor formatting inconsistency.Line 75 has inconsistent spacing before the type compared to other fields.
🔎 Proposed fix
- memoryCap string + memoryCap stringcmd/pod-scaler/measured_test.go (1)
1-253: Good test coverage, consider adding tests forApplyMeasuredPodResources.The tests cover
ShouldBeMeasured,ClassifyPod,AddPodAntiAffinity, andMeasuredPodCache. However,ApplyMeasuredPodResourcesis not tested, which contains important logic for:
- Finding the longest-running container
- Applying the 20% buffer to CPU/memory requests
- Fallback to first container when duration data is missing
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/measured.gocmd/pod-scaler/admission.go
🧬 Code graph analysis (1)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
BigQueryClient(94-101)NewMeasuredPodCache(70-75)NewBigQueryClient(104-137)ClassifyPod(247-276)AddPodAntiAffinity(281-333)ApplyMeasuredPodResources(338-411)
🔇 Additional comments (6)
cmd/pod-scaler/admission_test.go (1)
557-557: LGTM!The test correctly updated to match the new
mutatePodResourcessignature, passingfalsefor bothauthoritativeCPURequestsandauthoritativeMemoryRequeststo preserve existing test behavior.cmd/pod-scaler/admission.go (1)
117-126: LGTM - correct ordering of measured pod operations.The sequence is correct:
- Classify pod based on metadata
- Add anti-affinity rules
- Apply measured resources
- Then apply standard resource mutation (which uses
useOursIfLargerto only increase resources)This ensures measured data is applied first, and standard recommendations only increase values if they're higher.
cmd/pod-scaler/measured_test.go (1)
16-61: LGTM - well-structured unit tests.Good coverage of edge cases for
ShouldBeMeasured: no data, recent data (5 days), and stale data (15 days). The test correctly validates the 10-day threshold behavior.cmd/pod-scaler/measured.go (3)
85-91: Cache update replaces entire dataset - verify this is intentional.The
Updatemethod replaces the entire cache map rather than merging. This is appropriate for a daily full refresh from BigQuery, but be aware that any data not returned in the latest query will be lost.
396-410: LGTM - consistent 20% buffer application.The 1.2x buffer matches the existing pattern in
useOursIfLarger(120% increase). The CPU calculation correctly converts cores to millicores. For measured pods, this authoritative data appropriately replaces existing requests.
264-267: Defensive nil handling defaults to measured - reasonable behavior.When
bqClientis nil (which shouldn't happen in normal flow sinceClassifyPodis only called whenbqClient != nilin admission.go), defaulting to "measured" is conservative and safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Line 321: The function mutatePodResources has two unused boolean parameters
named as anonymous `_ bool, _ bool` (meant for authoritativeCPURequests and
authoritativeMemoryRequests); either implement their intended behavior to skip
mutation for pods with authoritative requests by adding explicit parameters
(e.g., authoritativeCPURequests, authoritativeMemoryRequests) to the signature
and branching early in mutatePodResources to return without mutating when those
flags indicate authority, or remove the anonymous bool parameters entirely from
the signature and all call sites to avoid dead parameters; update references to
the function (callers and any documentation/tests) to match the new signature to
keep the API consistent.
In @cmd/pod-scaler/measured.go:
- Around line 123-133: Startup currently calls bq.Refresh once and only warns on
error, leaving the cache empty so ShouldBeMeasured() returns true for every pod;
change the startup behavior to retry bq.Refresh until success (with a capped
backoff/retry limit) before scheduling the daily refresh via
interrupts.TickLiteral, or alternatively set a flag on the BigQuery client
(e.g., cachePopulated) that is only true after a successful bq.Refresh and
update ShouldBeMeasured to return false while that flag is false; ensure you log
each retry and ultimately fail fast (exit or return an error) if retries are
exhausted, and keep BigQueryRefreshInterval-based TickLiteral unchanged for the
periodic refresh.
🧹 Nitpick comments (3)
cmd/pod-scaler/main.go (2)
99-102: Missing validation for required BigQuery flags.The help text states these flags are "required if enable-measured-pods is true", but
options.validate()doesn't enforce this. The validation happens inadmit()instead, which is inconsistent with the validation pattern for other consumer.admission flags (lines 123-138).♻️ Suggested fix in validate()
Add this check in the
case "consumer.admission":block around line 138:if o.enableMeasuredPods { if o.bigQueryProjectID == "" { return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") } if o.bigQueryDatasetID == "" { return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") } }
282-283: Function call has many parameters - consider passing a config struct.The
admit()function now takes 16 parameters, making the call hard to read and maintain. Consider grouping related parameters into a configuration struct in a future refactor.cmd/pod-scaler/measured_test.go (1)
151-210: Consider adding assertions for label selector content.The tests verify that anti-affinity structures are created but don't validate the actual
LabelSelectorRequirementvalues. This could miss bugs where the wrong operator or values are used.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (1)
vendor/modules.txtis excluded by!vendor/**,!**/vendor/**
📒 Files selected for processing (4)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/measured_test.go (3)
cmd/pod-scaler/measured.go (8)
NewMeasuredPodCache(69-74)BigQueryClient(93-100)MeasuredPodData(43-58)ClassifyPod(246-275)PodScalerLabelKey(26-26)PodScalerLabelValueMeasured(32-32)PodScalerLabelValueNormal(29-29)AddPodAntiAffinity(280-332)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)pkg/api/types.go (1)
Metadata(161-166)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (2)
Metadata(161-166)ResourceList(261-261)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (12)
cmd/pod-scaler/measured.go (6)
1-21: LGTM!Imports are well-organized and appropriate for the functionality being implemented.
23-100: LGTM!Constants are well-documented, and the type definitions are clean with appropriate thread-safety for the cache.
223-236: LGTM!The measurement staleness logic is clear and correctly implements the 10-day retention policy.
297-311: Measured pod anti-affinity may be too restrictive.The
LabelSelectorOpExistsoperator makes measured pods avoid ALL pods with thepod-scalerlabel, including themselves. This means if two measured pods are created simultaneously, they cannot share the same node - which is the intended behavior. However, verify that the cluster has sufficient nodes to handle multiple concurrent measured pods.
390-409: LGTM!Resource application logic correctly converts CPU to millicores and memory to bytes with appropriate 20% safety buffers.
143-164: The placeholder query issue is already explicitly flagged in the code via the TODO comments at lines 143-145, which clearly state this is a placeholder query requiring alignment with the actual BigQuery schema. No additional review comment is needed for this concern.cmd/pod-scaler/main.go (1)
70-77: LGTM!New measured pods configuration fields are well-documented and logically grouped.
cmd/pod-scaler/measured_test.go (3)
16-61: LGTM!
TestShouldBeMeasuredcovers the three key scenarios: no data, recent data, and stale data. The test correctly validates the measurement decision logic.
63-149: LGTM!Test case 2 correctly uses
MetadataForto ensure metadata matching, demonstrating good awareness of the cache key requirements. Good coverage of the nil BigQuery client fallback behavior.
212-253: LGTM!Cache tests cover the essential operations: empty lookup and data retrieval after update with field verification.
cmd/pod-scaler/admission.go (2)
36-66: LGTM!Measured pods initialization is properly gated behind the feature flag with appropriate validation and logging.
68-81: LGTM!The extended
podMutatortype cleanly integrates the new measured pods fields alongside existing configuration.
|
/hold and wait for #4886 to be reviewd/merged |
|
Scheduling required tests: Scheduling tests matching the |
211c573 to
3ac074f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Around line 117-126: Measured pod resources set by ApplyMeasuredPodResources
can be overwritten by mutatePodResources; update the flow so measured pods are
treated authoritative: after ClassifyPod (which adds the measured label), detect
the measured label on the pod (or call a helper like isMeasuredPod) and if
measured, either skip calling mutatePodResources entirely or call
mutatePodResources with a flag that prevents
applyRecommendationsBasedOnRecentData from increasing resources; make the change
near the existing call sites using the functions ApplyMeasuredPodResources,
mutatePodResources, ClassifyPod and applyRecommendationsBasedOnRecentData so
measured pods retain their measured resources.
In @cmd/pod-scaler/measured_test.go:
- Around line 24-31: Test constructs podscaler.FullMetadata manually missing
fields that MetadataFor() would set (Target, Pod, Step), causing divergence from
production behavior used by ClassifyPod; update the test to build metadata via
podscaler.MetadataFor(...) using the same pod labels/annotations the production
code uses (or explicitly populate Target, Pod, and Step on the FullMetadata
instance) so the test mirrors real metadata and cache lookup keys used by
ClassifyPod.
In @cmd/pod-scaler/measured.go:
- Around line 329-331: The current assignment overwrites any existing
anti-affinity rules; change it to append requiredTerms to the existing slice
instead: ensure pod.Spec.Affinity and pod.Spec.Affinity.PodAntiAffinity are
non-nil (initialize them if needed) and then do
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
=
append(pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
requiredTerms...) so existing rules are preserved.
- Around line 123-133: When initial bq.Refresh(ctx) fails at startup (leaving
the BigQuery cache empty and causing ShouldBeMeasured() to return true for all
pods), implement retry/backoff or set a "bigQueryReady" boolean on the BigQuery
client so that until Refresh() succeeds you treat the DB as not-ready; modify
the startup block calling bq.Refresh(ctx) to loop with limited retries and delay
(or use exponential backoff) and only log error and continue after marking
bigQueryReady=false, and update ShouldBeMeasured() to return false (or
fallback-safe behavior) when bigQueryReady is false; keep
BigQueryRefreshInterval scheduling but ensure the scheduled tick, when a
subsequent Refresh() succeeds, sets bigQueryReady=true so normal measurement
resumes.
- Around line 205-206: The TODO leaves containerDurations empty so
ApplyMeasuredPodResources cannot pick the longest container; parse the
container_durations JSON string into the containerDurations
map[string]time.Duration before use: locate where containerDurations is
declared, unmarshal the JSON (e.g., from the measured object or incoming field
named "container_durations") into a temporary map[string]string or
map[string]interface{}, convert each value to time.Duration using
time.ParseDuration (and handle numeric values by interpreting them as seconds if
needed), populate containerDurations accordingly, and on any parse error log it
and fall back to the existing default behavior so ApplyMeasuredPodResources can
correctly evaluate container runtimes.
🧹 Nitpick comments (2)
cmd/pod-scaler/main.go (1)
116-165: Missing validation for BigQuery flags when measured pods are enabled.When
--enable-measured-podsis true,--bigquery-project-idand--bigquery-dataset-idare required, but this validation only happens later inadmit()(which callslogrus.Fatal). Moving this validation tovalidate()would provide earlier, cleaner error reporting.🔧 Suggested addition to validate()
case "consumer.admission": // ... existing validation ... + if o.enableMeasuredPods { + if o.bigQueryProjectID == "" { + return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") + } + if o.bigQueryDatasetID == "" { + return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") + } + }cmd/pod-scaler/admission_test.go (1)
730-737: Consider adding tests for authoritative mode.The renamed function
applyRecommendationsBasedOnRecentDatanow supports authoritative CPU/memory reduction, but all tests passfalse, false. Consider adding test cases whereauthoritativeCPUorauthoritativeMemoryistrueto verify the reduction logic.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (6)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/measured.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/admission.gocmd/pod-scaler/resources.go
🧬 Code graph analysis (3)
cmd/pod-scaler/measured_test.go (2)
cmd/pod-scaler/measured.go (8)
NewMeasuredPodCache(69-74)BigQueryClient(93-100)MeasuredPodData(43-58)ClassifyPod(246-275)PodScalerLabelKey(26-26)PodScalerLabelValueMeasured(32-32)PodScalerLabelValueNormal(29-29)AddPodAntiAffinity(280-332)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
cmd/pod-scaler/resources.go (1)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)
🔇 Additional comments (12)
cmd/pod-scaler/measured.go (4)
62-90: LGTM - Thread-safe cache implementation.The
MeasuredPodCachecorrectly usessync.RWMutexfor concurrent access. TheUpdatemethod replaces the entire map atomically which is appropriate for bulk updates from BigQuery.
243-275: LGTM - ClassifyPod logic is sound.The function correctly initializes labels if nil, checks each container, and defaults to "measured" when BigQuery is unavailable. This fail-safe approach ensures new pods get measured when there's no data.
390-409: LGTM - Resource application with safety buffer.The 20% buffer on measured data and proper nil-checks for
Resources.Requestsare appropriate. The logic correctly targets only the longest-running container.
196-214: Incomplete metadata construction - missingVariant,Target,Pod, andStepfields.The
FullMetadatastruct (perpkg/pod-scaler/types.go) includesVariant,Target,Pod, andStepfields that are not populated here. SinceMetadataFor()inClassifyPodpopulates these from pod labels, cache lookups will fail to match.Likely an incorrect or invalid review comment.
cmd/pod-scaler/resources.go (2)
101-127: LGTM - Recent data filtering and weighting implementation.The time-window filtering with sample count validation prevents recommendations based on stale or insufficient data. The weighting scheme (3x/2x/1x) appropriately prioritizes recent usage patterns.
140-149: LGTM - Minimum threshold enforcement.Clamping CPU to at least 10m and memory to at least 10Mi prevents scheduling issues from zero/tiny recommendations.
cmd/pod-scaler/measured_test.go (2)
151-210: LGTM - Anti-affinity tests cover key scenarios.Tests verify measured pods avoid all pod-scaler pods, normal pods avoid measured pods, and unlabeled pods get no affinity rules.
212-253: LGTM - Cache tests verify basic Get/Update behavior.The test correctly validates empty cache returns false, and populated cache returns correct values.
cmd/pod-scaler/admission_test.go (1)
557-557: LGTM - Test updated for new function signature.The test correctly passes
false, falsefor the new authoritative flags, preserving existing behavior verification.cmd/pod-scaler/admission.go (2)
225-305: LGTM - Authoritative reduction logic with safety guards.The gradual reduction approach (max 25% per cycle, minimum 5% threshold) prevents aggressive resource cuts that could cause OOMs. The early return for zero recommendations (lines 247-249) is a good addition.
43-55: LGTM - BigQuery client initialization properly guarded.The initialization is correctly gated behind
enableMeasuredPodswith required field validation and proper error handling vialogrus.Fatal.cmd/pod-scaler/main.go (1)
286-287: Parameter order is correct—no action needed.The 16 parameters in the
admit()call match the function signature exactly. No misalignment detected.
| // Initial refresh on startup | ||
| if err := bq.Refresh(ctx); err != nil { | ||
| logger.WithError(err).Warn("Failed to refresh BigQuery data on startup") | ||
| } | ||
|
|
||
| // Schedule daily refresh | ||
| interrupts.TickLiteral(func() { | ||
| if err := bq.Refresh(interrupts.Context()); err != nil { | ||
| logger.WithError(err).Error("Failed to refresh BigQuery data") | ||
| } | ||
| }, BigQueryRefreshInterval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial refresh failure leaves cache empty, classifying all pods as "measured".
If BigQuery is unavailable at startup, Refresh() fails with only a warning. Since the cache remains empty, ShouldBeMeasured() returns true for all pods, triggering unnecessary isolation for the entire cluster. Consider retry logic or a "not ready" state.
🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 123 - 133, When initial
bq.Refresh(ctx) fails at startup (leaving the BigQuery cache empty and causing
ShouldBeMeasured() to return true for all pods), implement retry/backoff or set
a "bigQueryReady" boolean on the BigQuery client so that until Refresh()
succeeds you treat the DB as not-ready; modify the startup block calling
bq.Refresh(ctx) to loop with limited retries and delay (or use exponential
backoff) and only log error and continue after marking bigQueryReady=false, and
update ShouldBeMeasured() to return false (or fallback-safe behavior) when
bigQueryReady is false; keep BigQueryRefreshInterval scheduling but ensure the
scheduled tick, when a subsequent Refresh() succeeds, sets bigQueryReady=true so
normal measurement resumes.
|
/test images |
|
Scheduling required tests: Scheduling tests matching the |
|
@deepsm007: The specified target(s) for The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Scheduling required tests: Scheduling tests matching the |
|
/test images |
3ac074f to
bb0253c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cmd/pod-scaler/main.go (1)
127-142: Add validation for measured pods BigQuery requirements.When
--enable-measured-podsis enabled, the BigQuery parameters (--bigquery-project-id,--bigquery-dataset-id, and optionally--bigquery-credentials-file) should be validated as required. Currently, missing these parameters will cause runtime failures instead of failing fast at startup with a clear error message.✅ Suggested validation logic
case "consumer.admission": if o.port == 0 { return errors.New("--port is required") } if o.certDir == "" { return errors.New("--serving-cert-dir is required") } if cpuCap := resource.NewQuantity(o.cpuCap, resource.DecimalSI); cpuCap.Sign() <= 0 { return errors.New("--cpu-cap must be greater than 0") } if memoryCap := resource.MustParse(o.memoryCap); memoryCap.Sign() <= 0 { return errors.New("--memory-cap must be greater than 0") } + if o.enableMeasuredPods { + if o.bigQueryProjectID == "" { + return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") + } + if o.bigQueryDatasetID == "" { + return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") + } + // Note: bigQueryCredentialsFile may use default application credentials if not specified + } if err := o.resultsOptions.Validate(); err != nil { return err }cmd/pod-scaler/admission.go (1)
250-253: Potential resource inflation from stacked 1.2x buffers.The 1.2x safety buffer here is applied unconditionally to all recommendations. For measured pods,
ApplyMeasuredPodResourcesalready applies a 1.2x buffer (lines 396 and 405 in measured.go). If both are applied, resources would be inflated by ~44% (1.2 × 1.2) instead of the intended 20%.Consider either:
- Skipping the 1.2x here for resources already set by measured pod logic, or
- Documenting that the stacking is intentional
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 329-331: The code overwrites existing anti-affinity rules by
assigning requiredTerms to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution;
instead, preserve and append: ensure PodAffinity/PodAntiAffinity and
RequiredDuringSchedulingIgnoredDuringExecution are non-nil, then append
requiredTerms to the existing slice (e.g., use append(existingSlice,
requiredTerms...)) so you merge new rules instead of replacing them, updating
the
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
field accordingly.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (10)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.gotest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.gotest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamltest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
- test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamlcmd/pod-scaler/admission_test.gocmd/pod-scaler/resources.gocmd/pod-scaler/admission.gotest/e2e/pod-scaler/run/producer.gotest/e2e/pod-scaler/run/consumer.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (3)
pkg/results/report.go (1)
PodScalerReporter(174-176)cmd/pod-scaler/measured.go (6)
BigQueryClient(93-100)NewMeasuredPodCache(69-74)NewBigQueryClient(103-136)ClassifyPod(246-275)AddPodAntiAffinity(280-332)ApplyMeasuredPodResources(337-410)pkg/api/types.go (1)
ResourceRequirements(246-253)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (16)
test/e2e/pod-scaler/run/producer.go (1)
8-8: LGTM!The addition of the
POD_SCALER_MIN_SAMPLES=1environment variable properly configures the test environment for minimum sample handling, consistent with the measured pods feature.Also applies to: 31-31
test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)
1-1: LGTM!The fixture correctly reflects the architectural shift away from patch-based mutations in the admission flow, aligning with the new measured-pod implementation.
test/e2e/pod-scaler/run/consumer.go (1)
71-71: LGTM!The environment variable configuration mirrors the producer setup and properly configures the admission server for testing with minimum sample handling.
cmd/pod-scaler/admission_test.go (1)
557-557: LGTM!The test updates correctly reflect the new function signatures with authoritative CPU/memory flags set to
false, preserving the original non-authoritative behavior for existing test cases. The function rename fromuseOursIfLargertoapplyRecommendationsBasedOnRecentDataalso improves clarity.Also applies to: 732-732, 817-817
cmd/pod-scaler/main.go (2)
64-78: LGTM!The new configuration options for measured pods and authoritative requests are well-documented with clear help text. The field naming and structure are consistent with existing patterns.
Also applies to: 101-106
286-286: LGTM with dependency on validation.The
admit()call correctly passes all new configuration parameters for measured pods and authoritative requests. Ensure the validation issue flagged above is addressed to prevent runtime failures.cmd/pod-scaler/resources.go (4)
42-58: Constants look reasonable for the use case.The 3-week window, 10m CPU minimum, and 10Mi memory minimum are sensible defaults to prevent stale data influence and avoid zero/near-zero recommendations that could cause scheduling issues.
99-131: Recency-weighted histogram merging is well-designed.The tiered weighting (3x for <7 days, 2x for 7-14 days, 1x for older) sensibly prioritizes recent usage patterns. The minimum sample threshold prevents unreliable recommendations based on sparse data.
144-153: Minimum threshold clamping prevents degenerate recommendations.Ensuring CPU ≥ 10m and memory ≥ 10Mi avoids scheduling issues from near-zero resource requests.
168-178: Environment variable override for test flexibility is appropriate.The default of 3 samples provides a reasonable minimum for production while allowing tests to use fewer samples.
cmd/pod-scaler/admission.go (2)
43-55: BigQuery client initialization with proper validation.Correctly validates required parameters and fails fast on startup if configuration is incomplete.
271-299: Gradual reduction logic is well-designed.The 25% cap per cycle and 5% threshold for skipping micro-adjustments provide a safe, conservative approach to reducing over-provisioned resources.
cmd/pod-scaler/measured.go (4)
23-58: Well-structured constants and data types.The pod classification labels, retention period, and
MeasuredPodDatastructure are clearly designed for the measured pods feature.
62-90: Thread-safe cache implementation is correct.Proper use of
RWMutexfor concurrent read access and exclusive write access. Full map replacement inUpdateis a clean approach that avoids partial update issues.
334-409: Resource application logic is sound with sensible fallbacks.The approach of targeting the longest-running container and applying a 20% safety buffer is appropriate. The fallback to the first container when duration data is unavailable provides graceful degradation.
Note: As mentioned in the admission.go review, verify that the 1.2x buffer here doesn't stack with another 1.2x in
applyRecommendationsBasedOnRecentData.
246-275: Classification logic handles edge cases well.Defaulting to "measured" when BigQuery is unavailable is a safe, conservative approach. Marking the entire pod as measured if any container needs fresh data ensures complete coverage.
40f54dd to
9f38903
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/admission.go:
- Around line 252-265: The heuristic using alreadySetByMeasuredPods :=
!their.IsZero() && their.Cmp(our) >= 0 is unreliable—replace the
value-comparison detection with an explicit marker: set an annotation (or rely
on the existing PodScaler label) when resources are applied in
ApplyMeasuredPodResources, then in applyRecommendationsBasedOnRecentData check
that annotation/label (e.g.,
"pod-scaler.openshift.io/measured-resources-applied" == "true" or
PodScalerLabelKey == PodScalerLabelValueMeasured) instead of comparing their and
our so the 1.2x buffer is skipped only when we know measured-pod logic ran.
In @cmd/pod-scaler/measured.go:
- Around line 125-127: The startup currently swallows errors from
bq.Refresh(ctx) by logging a warning via logger.WithError(err).Warn, which
leaves the cache empty and misclassifies pods; change this to fail fast by
propagating or terminating on error: either return the error from the enclosing
initialization function (so callers can handle/fail) or replace the Warn call
with a fatal/exit path (e.g., logger.WithError(err).Fatal or equivalent) to stop
startup when bq.Refresh(ctx) fails; update the call site around bq.Refresh(ctx)
and any function signatures to return an error if choosing propagation.
🧹 Nitpick comments (6)
cmd/pod-scaler/resources.go (1)
96-159: Verify the weighting strategy and consider defensive timestamp validation.The time-windowing and weighting logic correctly filters recent data and applies appropriate weights. However, consider the following:
Weighting via histogram duplication (lines 118-120): The current approach duplicates histograms to apply weights (3x, 2x, 1x). While simple and effective, this means a single outlier in recent data gets amplified. This may be intentional to prioritize recent patterns, but verify this aligns with the desired sensitivity to recent changes.
No defense against future timestamps (line 107): Consider adding a check that
fingerprintTime.Addedis not in the future to guard against clock skew or data corruption:if fingerprintTime.Added.After(now) { metaLogger.Warnf("fingerprint has future timestamp, skipping: %v", fingerprintTime.Added) continue }Minimum threshold enforcement (lines 144-153): The post-quantile minimums correctly prevent zero or tiny resource recommendations.
cmd/pod-scaler/measured.go (2)
388-409: Consider the fallback logic for selecting the container.When container duration data is unavailable, the code falls back to the first container in
pod.Spec.Containers(lines 406-409). While this excludes init containers, it may still select a sidecar or helper container if it's listed first.This could result in applying measured resources to the wrong container if the pod spec doesn't list the main workload container first.
Consider adding a warning log when falling back, or requiring duration data to be present for proper resource application:
if longestContainer == nil && len(pod.Spec.Containers) > 0 { longestContainer = &pod.Spec.Containers[0] logger.Warnf("No container duration data available, falling back to first container: %s", longestContainer.Name) }
430-444: Hardcoded 20% safety buffer on measured resources.The code applies a fixed 20% buffer (lines 431, 440) on top of measured CPU and memory utilization. While this is a reasonable safety margin, consider making this configurable if different workloads require different safety margins.
cmd/pod-scaler/main.go (1)
295-295: Consider refactoring to reduce parameter count.The
admit()function now has 15 parameters, making it difficult to maintain and prone to parameter ordering errors. Consider passing a configuration struct instead.♻️ Suggested refactoring approach
Create an
admitConfigstruct to group related parameters:type admitConfig struct { port int healthPort int certDir string mutateResourceLimits bool cpuCap int64 memoryCap string cpuPriorityScheduling int64 authoritativeCPURequests bool authoritativeMemoryRequests bool enableMeasuredPods bool bigQueryProjectID string bigQueryDatasetID string bigQueryCredentialsFile string }Then call it as:
go admit(admitConfig{ port: opts.port, healthPort: opts.instrumentationOptions.HealthPort, // ... other fields }, client, loaders(cache), reporter)cmd/pod-scaler/admission.go (2)
43-56: Redundant validation check.Lines 45-47 perform validation that's already done in
main.go(lines 140-148). While defensive programming is good, this redundancy suggests unclear separation of concerns.Consider removing this validation since it's already enforced at startup, or if keeping it, add a comment explaining why the defensive check is necessary:
var bqClient *BigQueryClient if enableMeasuredPods { - if bigQueryProjectID == "" || bigQueryDatasetID == "" { - logrus.Fatal("bigquery-project-id and bigquery-dataset-id are required when enable-measured-pods is true") - } + // Defensive check - should already be validated in main.go cache := NewMeasuredPodCache(logger)
367-393: Consider consolidating measured pod detection logic.There's redundant checking between this code (lines 367-393) and the
alreadySetByMeasuredPodsheuristic inapplyRecommendationsBasedOnRecentData(line 256). This code uses an explicit label check which is more reliable, but having two places checking for the same condition suggests unclear separation of concerns.Consider:
- Removing the heuristic check in
applyRecommendationsBasedOnRecentDataand relying solely on this explicit check, OR- Passing a flag from this function to
applyRecommendationsBasedOnRecentDataindicating whether the container is from a measured podThis would make the logic clearer and more maintainable.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (10)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/resources.gotest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.gotest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamltest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
- test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/pod-scaler/measured_test.go
- test/e2e/pod-scaler/run/consumer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gotest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamlcmd/pod-scaler/admission.gotest/e2e/pod-scaler/run/producer.gocmd/pod-scaler/measured.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/resources.go
🧬 Code graph analysis (3)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (7)
BigQueryClient(94-101)NewBigQueryClient(104-137)ClassifyPod(273-302)AddPodAntiAffinity(307-367)ApplyMeasuredPodResources(372-445)PodScalerLabelKey(27-27)PodScalerLabelValueMeasured(33-33)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
🔇 Additional comments (17)
cmd/pod-scaler/resources.go (3)
42-58: LGTM! Well-defined constants with clear documentation.The time window of 3 weeks and minimum thresholds are appropriate for ensuring stable resource recommendations based on recent data.
60-64: LGTM! Environment-configurable minimum samples.The default of 3 samples is reasonable, and the environment override provides flexibility for testing scenarios.
168-178: LGTM! Proper environment variable handling with safe defaults.The function correctly validates the environment variable and falls back to a sensible default of 3 samples.
cmd/pod-scaler/measured.go (6)
24-40: LGTM! Clear constants with good documentation.The retention period of 10 days and daily refresh cycle are appropriate for balancing data freshness with measurement overhead.
42-91: LGTM! Thread-safe cache with proper synchronization.The use of
sync.RWMutexis appropriate for this read-heavy workload, and the locking patterns are correct.
218-233: Silent handling of JSON parsing errors for container durations.When parsing the
container_durationsJSON (lines 222, 229), errors are logged but execution continues with an empty map. This is reasonable for non-critical data, but be aware that:
- Missing duration data will cause the fallback logic (line 406-409 in ApplyMeasuredPodResources) to apply resources to the first container
- This may not always be the correct container if it's not the main workload
250-268: LGTM! Clear staleness logic.The measurement staleness check correctly identifies pods that need fresh data (never measured or >10 days old).
270-302: LGTM! Safe fallback when BigQuery is unavailable.The classification logic correctly defaults to measuring pods when BigQuery is unavailable (lines 291-293). This is a safe fallback that ensures data collection continues even if BigQuery integration is disabled or fails.
304-367: LGTM! Correct anti-affinity rules for pod isolation.The anti-affinity logic correctly:
- Isolates measured pods from ALL pod-scaler labeled pods (lines 324-338)
- Prevents normal pods from colocating with measured pods (lines 339-354)
- Preserves existing anti-affinity rules by merging (lines 358-365)
This ensures measured pods get clean isolation for accurate resource measurement.
test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)
1-1: LGTM! Test fixture updated to match new admission response format.The simplified fixture (no patch payload) aligns with the updated admission flow that handles resource mutations directly rather than via JSON Patch operations.
test/e2e/pod-scaler/run/producer.go (1)
31-31: LGTM! Appropriate test configuration.Setting
POD_SCALER_MIN_SAMPLES=1for e2e tests is appropriate, as it allows tests to work with limited sample data while still exercising the recommendation logic.cmd/pod-scaler/admission_test.go (2)
557-557: Test updated for new function signature.The test correctly passes the new
authoritativeCPUandauthoritativeMemoryparameters asfalse. Consider adding test cases that exercise these parameters withtruevalues to ensure authoritative mode behaves correctly.
732-732: Function renamed and tests updated.The tests correctly use the renamed
applyRecommendationsBasedOnRecentDatafunction with the new boolean parameters. Consider adding test coverage for the authoritative modes (when either boolean istrue) to validate that behavior.Also applies to: 817-817
cmd/pod-scaler/main.go (2)
64-79: LGTM!The new fields for measured pods and BigQuery integration are well-structured and documented. The documentation comment clearly explains the purpose of the feature.
101-106: LGTM!The flag definitions and validation logic are well-structured:
- Descriptive help text explains each flag's purpose
- Validation correctly enforces that BigQuery project and dataset IDs are required when measured pods are enabled
- Appropriately allows credentials file to be optional (falling back to default application credentials)
Also applies to: 140-148
cmd/pod-scaler/admission.go (2)
118-124: LGTM!The measured pods handling flow is well-structured:
- Classify the pod based on measurement data freshness
- Add anti-affinity rules for isolated execution
- Apply measured resources before standard mutation
This ensures measured pods get accurate resource values from BigQuery data.
282-309: LGTM!The authoritative mode reduction logic is well-designed with appropriate safety limits:
- Maximum 25% reduction per cycle prevents sudden resource drops
- Minimum 5% threshold prevents micro-adjustments
- Proper edge case handling when current value is zero
This gradual approach ensures stability while still allowing optimization based on actual usage.
| if err := bq.Refresh(ctx); err != nil { | ||
| logger.WithError(err).Warn("Failed to refresh BigQuery data on startup") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Startup continues with failed BigQuery refresh.
If the initial refresh fails at startup (line 125), only a warning is logged and the service continues. This means:
- The cache will be empty initially
- All pods will be classified as "measured" (since no data exists)
- This could lead to unnecessary isolation overhead until the next scheduled refresh
Consider if this is the desired behavior or if startup should fail when BigQuery is unavailable.
🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 125 - 127, The startup currently
swallows errors from bq.Refresh(ctx) by logging a warning via
logger.WithError(err).Warn, which leaves the cache empty and misclassifies pods;
change this to fail fast by propagating or terminating on error: either return
the error from the enclosing initialization function (so callers can
handle/fail) or replace the Warn call with a fatal/exit path (e.g.,
logger.WithError(err).Fatal or equivalent) to stop startup when bq.Refresh(ctx)
fails; update the call site around bq.Refresh(ctx) and any function signatures
to return an error if choosing propagation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
cmd/pod-scaler/resources.go (2)
96-159: LGTM: Time-windowing and weighting logic is sound.The implementation correctly:
- Filters data to a 3-week recency window
- Applies age-based weighting (3x/2x/1x) to prioritize recent patterns
- Validates sample count before making recommendations
- Enforces minimum thresholds to prevent scheduling issues
♻️ Minor optimization: Reuse time.Now() result
Lines 99-100 call
time.Now()twice. Consider reusing a single value:-cutoffTime := time.Now().Add(-resourceRecommendationWindow) -now := time.Now() +now := time.Now() +cutoffTime := now.Add(-resourceRecommendationWindow)This is a negligible performance improvement but improves consistency.
171-178: LGTM: Environment variable handling is appropriate.The function correctly reads and validates the
POD_SCALER_MIN_SAMPLESconfiguration with a sensible default of 3.♻️ Optional: Log warning for invalid POD_SCALER_MIN_SAMPLES values
Consider logging when an invalid value is provided to aid troubleshooting:
func getMinSamplesForRecommendation() int { if val := os.Getenv("POD_SCALER_MIN_SAMPLES"); val != "" { - if minSamples, err := strconv.Atoi(val); err == nil && minSamples > 0 { + minSamples, err := strconv.Atoi(val) + if err != nil { + logrus.Warnf("Invalid POD_SCALER_MIN_SAMPLES value %q: %v, using default", val, err) + } else if minSamples <= 0 { + logrus.Warnf("POD_SCALER_MIN_SAMPLES must be positive, got %d, using default", minSamples) + } else { return minSamples } } return 3 }This helps users catch configuration mistakes early, though the silent fallback approach is also acceptable for this use case.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
cmd/pod-scaler/resources.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/resources.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (1)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)
🔇 Additional comments (3)
cmd/pod-scaler/resources.go (3)
4-7: LGTM: Imports are appropriate.The added standard library imports (
os,strconv,time) are minimal and directly support the new time-windowing and configuration features.
45-58: LGTM: Constants are well-defined and documented.The 3-week recommendation window strikes a good balance between recency and data sufficiency, while the minimum thresholds (10 millicores CPU, 10Mi memory) prevent scheduling issues from zero/near-zero recommendations.
60-64: LGTM: Package-level configuration is appropriate.The environment variable approach for overriding
minSamplesForRecommendationis suitable for e2e testing scenarios where sample counts may be limited within the time window.
3c5851d to
196c83c
Compare
be58c3f to
d70644b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cmd/pipeline-controller/config_watcher.go (1)
112-117: Comment is misleading: this returns a shallow copy, not a deep copy.Returning a struct by value copies the struct, but the contained slices (
Orgsand nestedRepos) share the same underlying arrays withw.config. This works correctly for your use case becausedecoder.Decode()inreloadConfig()allocates fresh slices each time, so the old and new configs don't share memory.Consider updating the comment to avoid confusion during future maintenance.
📝 Suggested comment fix
-// getConfigCopy returns a deep copy of the current config for comparison +// getConfigCopy returns a copy of the current config for comparison. +// Note: This is a shallow copy, but it's safe because reloadConfig() allocates +// new slices via the YAML decoder, so old and new configs don't share memory. func (w *watcher) getConfigCopy() enabledConfig {
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (12)
.gitignorecmd/pipeline-controller/config_watcher.gocmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/resources.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diffcmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.difftest/e2e/pod-scaler/run/consumer.gotest/e2e/pod-scaler/run/producer.gotest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamltest/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
💤 Files with no reviewable changes (1)
- test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data_patch.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- cmd/pod-scaler/admission_test.go
- .gitignore
- test/e2e/pod-scaler/run/producer.go
- cmd/pod-scaler/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yamlcmd/pipeline-controller/config_watcher.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diffcmd/pod-scaler/admission.gocmd/pod-scaler/resources.gocmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.difftest/e2e/pod-scaler/run/consumer.go
🧬 Code graph analysis (1)
cmd/pod-scaler/resources.go (2)
pkg/pod-scaler/types.go (1)
CachedQuery(53-68)test/e2e/pod-scaler/prometheus/backfill.go (1)
Data(43-45)
🔇 Additional comments (17)
cmd/pipeline-controller/config_watcher.go (2)
3-11: Imports are appropriate for the polling-based approach.The additions of
reflectandtimesupport the new ticker-based config watching with deep equality comparison.
85-110: Polling approach is well-reasoned for git-sync compatibility.The 3-minute polling interval is sensible, and the error handling correctly allows retries without crashing. The change detection logic properly preserves
prevConfigwhen reload fails.Consider adding a context or stop channel for graceful shutdown if this service needs orderly termination, but this is optional for a long-running daemon.
cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add__limits_disabled.diff (1)
15-54: The test fixture is correct. All three containers receive identical recommendations (5 CPU, 20GB memory) in this test case, which explains the uniform allocation. The 1.2x safety buffer and global resource caps (10 CPU, 20Gi memory) are intentionally applied to each container independently, resulting in the expected values in the fixture. No "longest-running container" selection logic exists—recommendations apply uniformly across all containers per their respective recommended values.Likely an incorrect or invalid review comment.
test/e2e/pod-scaler/testdata/zz_fixture_TestAdmission_pod_for_which_we_have_data.yaml (1)
1-1: LGTM!The simplified AdmissionReview response fixture correctly reflects the new admission flow where patches are handled separately. The response structure with
allowed: trueand status code 200 is appropriate.test/e2e/pod-scaler/run/consumer.go (1)
71-71: LGTM!Adding
POD_SCALER_MIN_SAMPLES=1to the test environment correctly enables e2e tests to run with minimal sample data, aligning with the new configurable threshold inresources.go.cmd/pod-scaler/testdata/zz_fixture_TestMutatePodResources_resources_to_add.diff (1)
1-90: LGTM!The test fixture correctly reflects the new resource handling behavior:
- Memory limit increase (400M → 480M) demonstrates the 1.2x safety buffer
- CPU limit removal aligns with
reconcileLimits()behavior- Updated resource values match the new recommendation logic
cmd/pod-scaler/resources.go (5)
42-58: LGTM!The new constants are well-documented with clear rationale:
- 3-week recommendation window provides recency without being too narrow
- Minimum thresholds (10m CPU, 10Mi memory) prevent scheduling issues from near-zero recommendations
106-122: LGTM!The recency weighting implementation (3x/2x/1x for past week/fortnight/rest) is a sensible approach for prioritizing recent usage patterns while maintaining historical context.
124-131: LGTM!The minimum sample validation prevents unreliable recommendations based on insufficient data, with clear debug logging.
144-153: LGTM!The minimum threshold enforcement correctly prevents near-zero recommendations while preserving appropriate quantity formats.
171-178: LGTM!The environment variable parsing includes proper validation (positive integer check) with a sensible default fallback.
cmd/pod-scaler/admission.go (6)
43-55: LGTM!BigQuery client initialization properly validates required parameters and fails fast on configuration errors during startup.
117-126: LGTM!The measured pod handling flow is well-ordered: classify → add anti-affinity → apply measured resources → standard mutation. This ensures BigQuery-derived resources are applied first, with Prometheus recommendations filling any gaps.
289-317: LGTM!The gradual reduction logic is well-designed:
- Division-by-zero protection with
theirValue > 0check- 25% max reduction per cycle prevents drastic changes
- 5% minimum threshold avoids noisy micro-adjustments
- Clear logging for each decision path
372-401: LGTM!The measured pod handling in
mutatePodResourcescorrectly:
- Preserves BigQuery-derived resources by skipping Prometheus recommendations when already set
- Still applies resource caps and limit reconciliation for safety
- Falls through to normal flow for containers without pre-set resources
68-81: LGTM!The
podMutatorstruct extension cleanly adds the new fields for measured pod support and authoritative mode configuration.
269-270: Precision loss from float64→int64 conversion is acceptable.The
our.Set(int64(increased))truncation on line 270 is safe because practical resource values (multiples of 5, 10, 100, etc.) when multiplied by 1.2 typically yield whole numbers. Test cases confirm this: 100 × 1.2 = 120, 200 × 1.2 = 240, 2e10 × 1.2 = 24e9, and so on. For memory (bytes), any minor truncation is negligible; for CPU quantities viaresource.Quantity, the API handles these conversions appropriately.
d70644b to
13781c2
Compare
…ed nodes Implements DPTP-4613 to address the issue where pod-scaler recommendations are skewed by node contention. When multiple pods with poor CPU configurations are scheduled on the same node, CPU is maxed out but pods finish eventually. The pod-scaler observes low CPU utilization (due to node contention) and incorrectly concludes requests should not be increased, leading to a cycle of reduced limits and tighter packing. This change introduces a measured pods system: - Pods are classified as 'normal' or 'measured' based on whether they need fresh measurement data (measured if last measurement >10 days ago or never measured) - Measured pods use podAntiAffinity rules to run on isolated nodes with no other CI workloads, ensuring accurate CPU/memory utilization measurement - BigQuery integration queries and caches max CPU/memory utilization from measured pod runs, refreshing daily to keep data current - Resource recommendations are applied only to the longest-running container in each pod, using actual measured utilization data instead of Prometheus metrics that may be skewed by node contention The feature is opt-in via --enable-measured-pods flag and requires BigQuery configuration (--bigquery-project-id and --bigquery-dataset-id).
c0b2961 to
2b3e606
Compare
|
@deepsm007: This pull request references DPTP-4613 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cmd/pod-scaler/producer.go (1)
76-109: Duplicate BigQuery client creation logic.This code creates a BigQuery client similarly to
NewBigQueryClientin measured.go but without the cache and refresh scheduling. Consider reusingNewBigQueryClientor extracting common client creation logic to avoid duplication and inconsistency.Additionally, the
defer bqClient.Close()at line 101 will execute whenproducereturns, but withexecuteusingTickLiteral(line 84), the function may return before the periodic collection completes, potentially closing the client prematurely.cmd/pod-scaler/admission.go (1)
321-340: Unused parametersauthoritativeCPUandauthoritativeMemory.The
authoritativeCPUandauthoritativeMemoryparameters are added to the function signature but are not used in the function body. This appears to be dead code or incomplete implementation.#!/bin/bash # Check if these parameters are used elsewhere or if there's additional implementation rg -n 'authoritativeCPU|authoritativeMemory' --type=go -C 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 205-206: The containerDurations map is never populated; implement
JSON parsing of the incoming "container_durations" string into
map[string]time.Duration (populate the existing containerDurations variable)
before ApplyMeasuredPodResources is called. Decode the JSON into a temporary
map[string]interface{} or map[string]string, convert each entry to a
time.Duration (use time.ParseDuration for string values and treat numeric values
as seconds), handle and log parsing errors without panicking, and ensure entries
are inserted into containerDurations so ApplyMeasuredPodResources and its
ContainerDurations lookup use the real durations instead of falling back to the
first container.
- Around line 329-331: The code currently overwrites any existing anti-affinity
by assigning requiredTerms to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution;
instead, ensure Affinity and PodAntiAffinity are non-nil, then append
requiredTerms to the existing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
slice (e.g., create the Affinity/PodAntiAffinity objects if nil, then set
RequiredDuringSchedulingIgnoredDuringExecution = append(existingSlice,
requiredTerms...)) so you preserve preexisting rules while adding the pod-scaler
terms.
- Around line 102-136: Add a Close method on BigQueryClient that delegates to
the underlying *bigquery.Client (implement func (bq *BigQueryClient) Close()
error { return bq.client.Close() }) and ensure callers that create the client
via NewBigQueryClient (e.g., where admission.go constructs the client) invoke
bq.Close() during application shutdown or in the relevant cleanup path so the
underlying BigQuery resources are released.
- Around line 342-346: The current check uses pod.Labels[PodScalerLabelKey] and
returns when the label is missing or equals PodScalerLabelValueMeasured, which
incorrectly skips applying historical measurements to normal pods; change the
logic so that pods labeled PodScalerLabelValueMeasured are skipped (to preserve
isolation) and measured data is applied only to pods that are not labeled
measured—i.e., update the condition that uses podScalerLabel, hasLabel,
PodScalerLabelKey and PodScalerLabelValueMeasured to return early when hasLabel
&& podScalerLabel == PodScalerLabelValueMeasured, allowing historical resource
assignment for pods without that label.
In @cmd/pod-scaler/producer.go:
- Around line 411-415: The label selector is incorrect: replace the current
measuredPodSelector value that uses pod_scaler="measured" with the
Prometheus-style label key label_pod_scaler="measured" (underscores replace
hyphens) and keep label_created_by_ci="true"; update measuredPodSelector and any
uses such as podLabelsQuery (which builds kube_pod_labels) so the query becomes
kube_pod_labels with the corrected selector referencing label_pod_scaler and
label_created_by_ci.
- Around line 505-527: The PromQL strings are invalid because they use subquery
timestamp ranges; instead build expressions that QueryRange can sample over
time: replace cpuQuery with a rate expression like
rate(container_cpu_usage_seconds_total{namespace="%s",pod="%s",container!="POD",container!=""}[3m])
(no outer max_over_time or timestamp formatting) and replace memoryQuery with
the raw gauge
container_memory_working_set_bytes{namespace="%s",pod="%s",container!="POD",container!=""}
(again no [start:end] timestamps); keep using client.QueryRange and then compute
the max from the returned samples in Go (the code around cpuResult/memoryResult
processing).
🧹 Nitpick comments (3)
cmd/pod-scaler/measured.go (2)
41-58: Inconsistent JSON tags on struct fields.
MeasuredPodDatahas a JSON tag onMetadatabut not on other exported fields. If this struct is intended for JSON serialization, the other fields should also have JSON tags. If not, consider removing the JSON tag fromMetadatafor consistency.
263-266: Consider the operational impact of defaulting to "measured" when BigQuery is unavailable.When
bqClientis nil, all pods are classified as "measured" and will require isolated nodes. This could cause scheduling issues if BigQuery connectivity is lost or misconfigured. Consider whether a "normal" default might be safer operationally, or add monitoring/alerting for this condition.cmd/pod-scaler/measured_test.go (1)
151-210: Tests verify structure existence but not selector content.The anti-affinity tests check that affinity structures are created but don't verify the actual
LabelSelectorcontent (operator type, values). Consider adding assertions to verify the selector correctly usesExistsfor measured pods andInwith the "measured" value for normal pods.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (6)
cmd/pod-scaler/admission.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.go
🚧 Files skipped from review as they are similar to previous changes (1)
- cmd/pod-scaler/main.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.gocmd/pod-scaler/measured.gocmd/pod-scaler/admission_test.gocmd/pod-scaler/admission.go
🧬 Code graph analysis (2)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
Cache(24-28)pkg/metrics/insights.go (1)
Context(38-38)
cmd/pod-scaler/admission.go (2)
pkg/results/report.go (1)
PodScalerReporter(174-176)cmd/pod-scaler/measured.go (6)
BigQueryClient(93-100)NewMeasuredPodCache(69-74)NewBigQueryClient(103-136)ClassifyPod(246-275)AddPodAntiAffinity(280-332)ApplyMeasuredPodResources(337-410)
🔇 Additional comments (10)
cmd/pod-scaler/measured.go (3)
23-39: LGTM - Well-defined constants with clear documentation.The constants are appropriately named and the comments clearly explain the purpose of each value.
68-90: LGTM - Thread-safe cache implementation.The cache correctly uses
RLockfor reads andLockfor writes, and the atomic map replacement inUpdateis a clean approach for cache refresh.
223-236: LGTM - Correct staleness check logic.The function properly checks for missing or stale measurement data using the configured retention period.
cmd/pod-scaler/measured_test.go (3)
16-61: LGTM - Good test coverage for ShouldBeMeasured.The test covers the three key scenarios: no existing data, recent data (within retention period), and stale data (outside retention period).
63-149: LGTM - Comprehensive test coverage for ClassifyPod.Tests cover the key scenarios including the nil BigQuery client fallback behavior.
212-253: LGTM - Good cache operation tests.Tests verify Get/Update operations and data retrieval correctness.
cmd/pod-scaler/producer.go (1)
648-682: LGTM - BigQuery write implementation.The streaming insert approach is appropriate for the expected data volume. Error handling for JSON marshaling failures is reasonable.
cmd/pod-scaler/admission_test.go (1)
557-557: LGTM - Test updated to match new function signature.The test correctly passes the new
authoritativeCPUandauthoritativeMemoryparameters asfalseto maintain existing test behavior.cmd/pod-scaler/admission.go (2)
36-55: LGTM - BigQuery client initialization with proper validation.The code correctly validates that BigQuery project and dataset IDs are provided when measured pods are enabled, and creates the client with appropriate error handling.
117-126: LGTM - Measured pod handling integrated into admission flow.The integration correctly orders the operations: classification, anti-affinity, measured resource application, then standard resource mutation.
| // NewBigQueryClient creates a new BigQuery client for measured pods | ||
| func NewBigQueryClient(projectID, datasetID, credentialsFile string, cache *MeasuredPodCache, logger *logrus.Entry) (*BigQueryClient, error) { | ||
| ctx := interrupts.Context() | ||
| var opts []option.ClientOption | ||
| if credentialsFile != "" { | ||
| opts = append(opts, option.WithCredentialsFile(credentialsFile)) | ||
| } | ||
|
|
||
| client, err := bigquery.NewClient(ctx, projectID, opts...) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to create BigQuery client: %w", err) | ||
| } | ||
|
|
||
| bq := &BigQueryClient{ | ||
| client: client, | ||
| projectID: projectID, | ||
| datasetID: datasetID, | ||
| logger: logger, | ||
| cache: cache, | ||
| } | ||
|
|
||
| // Initial refresh on startup | ||
| if err := bq.Refresh(ctx); err != nil { | ||
| logger.WithError(err).Warn("Failed to refresh BigQuery data on startup") | ||
| } | ||
|
|
||
| // Schedule daily refresh | ||
| interrupts.TickLiteral(func() { | ||
| if err := bq.Refresh(interrupts.Context()); err != nil { | ||
| logger.WithError(err).Error("Failed to refresh BigQuery data") | ||
| } | ||
| }, BigQueryRefreshInterval) | ||
|
|
||
| return bq, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for BigQuery client Close calls
rg -n 'bqClient.*Close|BigQueryClient.*Close' --type=goRepository: openshift/ci-tools
Length of output: 117
🏁 Script executed:
# Also search more broadly for Close patterns with bigquery
rg -n 'Close\(\)' cmd/pod-scaler/measured.goRepository: openshift/ci-tools
Length of output: 44
🏁 Script executed:
# Check the full measured.go file structure
wc -l cmd/pod-scaler/measured.goRepository: openshift/ci-tools
Length of output: 92
🏁 Script executed:
# Look for the BigQueryClient struct definition and its methods
ast-grep --pattern $'type BigQueryClient struct {
$$$
}'Repository: openshift/ci-tools
Length of output: 484
🏁 Script executed:
# Search for cleanup/defer patterns and Close methods in the codebase
rg -n 'defer.*Close|\.Close\(\)' cmd/pod-scaler/ --type=goRepository: openshift/ci-tools
Length of output: 327
🏁 Script executed:
# Check the context around the Close call in producer.go
sed -n '85,115p' cmd/pod-scaler/producer.goRepository: openshift/ci-tools
Length of output: 1152
🏁 Script executed:
# Look for NewBigQueryClient calls to understand the ownership flow
rg -n 'NewBigQueryClient' cmd/pod-scaler/ --type=go -B2 -A2Repository: openshift/ci-tools
Length of output: 953
🏁 Script executed:
# Check the full context in admission.go around the bqClient variable
sed -n '1,60p' cmd/pod-scaler/admission.goRepository: openshift/ci-tools
Length of output: 2338
🏁 Script executed:
# Search for any Close method on BigQueryClient
rg -n 'func.*BigQueryClient.*Close' --type=goRepository: openshift/ci-tools
Length of output: 44
🏁 Script executed:
# Look for the bqClient variable scope in admission.go
rg -n 'var bqClient' cmd/pod-scaler/ --type=go -B2 -A5Repository: openshift/ci-tools
Length of output: 1215
Add a Close method to BigQueryClient to properly cleanup the underlying BigQuery client.
The BigQueryClient struct holds a *bigquery.Client but lacks a Close() method to clean it up. In admission.go, the client is created once and never closed, potentially leaving resources open for the application's lifetime. Add a Close() method that delegates to the underlying client's Close():
func (bq *BigQueryClient) Close() error {
return bq.client.Close()
}Then call it during shutdown or wrap the creation with proper cleanup.
🤖 Prompt for AI Agents
In @cmd/pod-scaler/measured.go around lines 102 - 136, Add a Close method on
BigQueryClient that delegates to the underlying *bigquery.Client (implement func
(bq *BigQueryClient) Close() error { return bq.client.Close() }) and ensure
callers that create the client via NewBigQueryClient (e.g., where admission.go
constructs the client) invoke bq.Close() during application shutdown or in the
relevant cleanup path so the underlying BigQuery resources are released.
2b3e606 to
8c9bdcd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/pod-scaler/main.go (1)
120-168: Missing validation for BigQuery configuration when measured pods are enabled.When
enableMeasuredPodsis true, the BigQuery fields (bigQueryProjectID,bigQueryDatasetID) should be validated as required. Currently, the feature can be enabled without proper BigQuery configuration, which would cause runtime failures or unexpected behavior (all pods defaulting to "measured" classification).Proposed fix in validate()
case "producer": + if o.producerOptions.enableMeasuredPods { + if o.producerOptions.bigQueryProjectID == "" { + return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") + } + if o.producerOptions.bigQueryDatasetID == "" { + return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") + } + } return o.kubernetesOptions.Validate(false)Add similar validation for
consumer.admissionmode.
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 329-331: The current assignment to
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
overwrites any existing rules; instead, ensure Affinity and PodAntiAffinity are
non-nil, then append requiredTerms to the existing slice (preserving prior
entries) rather than replacing it. Concretely, check/create pod.Spec.Affinity
and pod.Spec.Affinity.PodAntiAffinity if nil, then set
RequiredDuringSchedulingIgnoredDuringExecution = append(existingSlice,
requiredTerms...), handling the case where existingSlice may be nil.
- Around line 263-266: The current else branch defaults shouldBeMeasured = true
when bqClient is nil; change this to default to "normal" by setting
shouldBeMeasured = false, or if the feature flag for BigQuery-based measurement
is enabled, fail-fast by returning an error/denying admission when bqClient ==
nil; update the logic in the function that computes shouldBeMeasured (the block
referencing bqClient and the shouldBeMeasured variable in measured.go) so it
either sets shouldBeMeasured = false when no bqClient is available or returns an
explicit error when the BigQuery measurement feature is enabled but bqClient is
nil.
- Around line 205-214: containerDurations is created empty and never populated
from row.ContainerDurations, so MeasuredPodData.ContainerDurations remains blank
and ApplyMeasuredPodResources cannot find the longest-running container; parse
the JSON string in row.ContainerDurations into a map[string]time.Duration
(handle errors) and populate MeasuredPodData.ContainerDurations before assigning
to data[meta]; update the parsing logic near the containerDurations declaration
and ensure the key names match what ApplyMeasuredPodResources expects (use the
same container name keys) and convert parsed duration values into time.Duration.
- Around line 342-346: The condition currently checks for
PodScalerLabelValueMeasured and returns early, preventing measured data from
being applied; invert this logic to only skip pods labeled as "measured" and
proceed for "normal" pods by changing the check to verify
pod.Labels[PodScalerLabelKey] == PodScalerLabelValueMeasured -> return,
otherwise continue (or explicitly check for PodScalerLabelValueNormal to apply
measured resources). Update the podScalerLabel/hasLabel branch accordingly so
measured pods are skipped and normal-labelled pods receive the cached measured
recommendations.
🧹 Nitpick comments (1)
cmd/pod-scaler/main.go (1)
290-290: Consider using an options struct to reduce parameter count.The
admit()call has 16+ positional parameters, making it difficult to read and prone to ordering errors.Consider grouping related parameters into a struct:
type admitOptions struct { port, healthPort int certDir string // ... other fields measuredPodsConfig *MeasuredPodsConfig }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/producer.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/pod-scaler/admission.go
- cmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (1)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (3)
cmd/pod-scaler/measured.go (3)
76-90: LGTM: Thread-safe cache implementation.The
MeasuredPodCachecorrectly usesRWMutexfor concurrent read access while protecting writes. TheUpdatemethod atomically replaces the entire map, which is appropriate for the refresh pattern.
277-327: Anti-affinity logic for pod isolation is well-structured.The differentiation between measured pods (avoid ALL pod-scaler labeled pods) and normal pods (only avoid measured pods) correctly implements the isolation requirements. The use of
RequiredDuringSchedulingIgnoredDuringExecutionensures hard anti-affinity.Note: The overwrite issue at line 330 is addressed in a separate comment.
143-164: Verify BigQuery schema before merging.Confirm that the
ci_operator_metricstable contains the columns referenced in the query:max_cpu,max_memory,container_durations, andpod_scaler_label. If the actual schema differs, update the query and any dependent logic accordingly.
8c9bdcd to
03ff253
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 348-350: The current assignment overwrites any existing
PodAntiAffinity terms; instead of replacing
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
with requiredTerms, merge by appending requiredTerms onto the existing slice:
ensure PodAntiAffinity and RequiredDuringSchedulingIgnoredDuringExecution are
non-nil, then set RequiredDuringSchedulingIgnoredDuringExecution =
append(existingSlice, requiredTerms...), so existing rules are preserved and new
ones added.
In @cmd/pod-scaler/producer.go:
- Around line 512-525: The PromQL in cpuQuery and memoryQuery is invalid because
max_over_time was given RFC3339 timestamps instead of a range vector; remove the
max_over_time wrapper and the timestamp window from the selector and instead
query the raw series over time (e.g., use
`rate(container_cpu_usage_seconds_total{namespace="%s",pod:"%s",container!="POD",container!=""}[3m])`
for CPU and `container_memory_working_set_bytes{...}` for memory) by formatting
without from.Format/until.Format, keep using client.QueryRange, and compute the
peak values in Go (the existing post-query min/max logic) rather than embedding
timestamps into PromQL.
- Around line 512-537: The PromQL queries (cpuQuery, memoryQuery,
containerInfoQuery) directly interpolate namespace/podName and are vulnerable to
query failures or injection from special characters; add a helper
escapePromQLValue(s string) that escapes backslashes, double quotes, and
newlines (e.g., replace `\` -> `\\`, `"` -> `\"`, `\n` -> `\\n`) and apply it to
namespace, podName, and any other label values before formatting the PromQL
strings so all fmt.Sprintf calls use escaped values.
🧹 Nitpick comments (3)
cmd/pod-scaler/producer.go (1)
547-607: Consider extracting duplicated min/max calculation logic.The min/max calculation pattern is repeated for CPU (lines 553-579), memory (lines 582-607), node CPU (lines 627-645), and node memory (lines 660-678). This could be simplified with a helper function.
♻️ Example helper
func computeMinMax[T float64 | int64](values []T) (min, max T) { if len(values) == 0 { return } min, max = values[0], values[0] for _, v := range values[1:] { if v < min { min = v } if v > max { max = v } } return }cmd/pod-scaler/admission.go (2)
36-36: Function signature has 16 parameters; consider using an options struct.The
admitfunction has grown to 16 parameters, making it difficult to read and maintain. This is a good candidate for consolidation into a configuration struct.♻️ Example refactor
type admitOptions struct { port, healthPort int certDir string mutateResourceLimits bool cpuCap int64 memoryCap string cpuPriorityScheduling int64 authoritativeCPU bool authoritativeMemory bool enableMeasuredPods bool bigQueryProjectID string bigQueryDatasetID string bigQueryCredentialsFile string } func admit(opts admitOptions, client buildclientv1.BuildV1Interface, loaders map[string][]*cacheReloader, reporter results.PodScalerReporter) { // ... }
321-321: Unused parametersauthoritativeCPUandauthoritativeMemoryadd complexity.These parameters are passed through the entire call chain but explicitly ignored with blank identifiers. If they're planned for future use, consider adding a TODO comment. Otherwise, remove them to simplify the code.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/producer.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
Cache(24-28)pkg/metrics/insights.go (1)
Context(38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (3)
BigQueryClient(94-101)NewMeasuredPodCache(70-75)NewBigQueryClient(104-137)
🔇 Additional comments (4)
cmd/pod-scaler/admission.go (1)
117-126: LGTM!The measured pod handling flow is well-structured: classify the pod, add anti-affinity rules based on classification, then apply measured resources before the standard mutation. Each function handles its own early-return logic appropriately.
cmd/pod-scaler/measured.go (3)
103-137: LGTM!The BigQuery client initialization is well-structured with proper error handling, initial refresh on startup, and scheduled daily refresh. The use of query labels for tracking is a good practice.
204-225: LGTM!The two-phase JSON parsing for
container_durationsprovides good backwards compatibility, first trying nanosecond integers and falling back to string duration format.
282-285: This behavior is intentional and tested—no issue here.The code defaults to measuring all pods when BigQuery is unavailable by design. This is a conservative safety default: if BigQuery isn't configured, measuring everything ensures no pods slip through unmeasured. The comment at line 283 documents this intent, and the test suite explicitly verifies this behavior with "Pod with nil BigQuery client - should default to measured."
If this causes scheduling concerns in production, the issue is configuration: ensure
enable-measured-podsis only used with valid BigQuery credentials, or disable the feature entirely.Likely an incorrect or invalid review comment.
03ff253 to
41d0e12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/producer.go:
- Around line 101-123: The BigQuery client bqClient is closed via defer inside
produce(), which returns immediately in non-once mode causing scheduled ticks
(execute/interrupts.TickLiteral) to use a closed client; instead register a
shutdown cleanup with interrupts.Run to close bqClient when the global interrupt
context is cancelled (use interrupts.Context()/interrupts.Run), or move bqClient
lifecycle to the caller so it outlives produce(); ensure the
collectMeasuredPodMetrics calls continue to use the live bqClient until the
registered shutdown handler closes it.
🧹 Nitpick comments (7)
cmd/pod-scaler/main.go (3)
74-81: Unused fields in consumerOptions.The
consumerOptionsstruct definesenableMeasuredPods,bigQueryProjectID,bigQueryDatasetID, andbigQueryCredentialsFile, butmainAdmissionon line 287 usesproducerOptionsfields instead. Either remove these fields fromconsumerOptionsor updatemainAdmissionto use the consumer-specific options.
88-93: Flags only bind to producerOptions.These flags only bind to
producerOptions, meaningconsumerOptionsfields remain empty. If consumer mode should support different BigQuery settings, add separate flag bindings. Otherwise, remove the duplicate fields fromconsumerOptions.
286-288: Clarify why producerOptions is used for consumer mode.The comment explains the intent, but it's inconsistent with having separate
consumerOptionsfields. Consider removing the duplicate fields fromconsumerOptionsto avoid confusion.cmd/pod-scaler/producer.go (2)
567-626: Consider extracting min/max computation into a helper function.The min/max computation logic is duplicated for CPU (lines 572-598) and memory (lines 600-626). Extracting this into a generic helper would reduce duplication.
800-800: Unused parameterprojectID.The
projectIDparameter is named with a blank identifier_, indicating it's unused. Either remove it from the signature or use it in the function body.Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error { +func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {And update the call site on line 480:
-if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil { +if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {cmd/pod-scaler/admission.go (2)
36-36: Consider using a config struct to reduce parameter count.The
admitfunction now has 16 parameters, making it hard to maintain. Consider grouping related parameters into configuration structs.
321-321: Unused authoritative parameters.
authoritativeCPUandauthoritativeMemoryparameters are declared with blank identifiers, indicating they're unused. If these are planned for future use, consider adding a TODO comment; otherwise, remove them from the signature.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.gocmd/pod-scaler/main.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (3)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
Cache(24-28)pkg/metrics/insights.go (1)
Context(38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
BigQueryClient(94-101)NewMeasuredPodCache(70-75)NewBigQueryClient(104-137)ClassifyPod(265-295)AddPodAntiAffinity(300-357)ApplyMeasuredPodResources(363-437)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (2)
Metadata(161-166)ResourceList(261-261)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (17)
cmd/pod-scaler/measured_test.go (4)
16-61: LGTM! Good coverage of ShouldBeMeasured edge cases.The test correctly validates the three key scenarios: no data exists (should measure), recent data within 10 days (should not measure), and stale data beyond 10 days (should measure).
63-149: LGTM! ClassifyPod test cases are well-designed.Good coverage of pod classification logic including the nil BigQuery client fallback to normal classification, which prevents overwhelming isolated nodes when BigQuery is unavailable.
151-210: LGTM! Anti-affinity tests verify scheduling isolation rules.The test correctly validates that measured pods avoid all pod-scaler labeled pods, normal pods avoid only measured pods, and unlabeled pods receive no anti-affinity rules.
212-253: LGTM! Cache tests cover basic Get/Update operations.The test validates cache miss, update, and subsequent cache hit with correct data retrieval.
cmd/pod-scaler/main.go (1)
262-263: LGTM! Producer options correctly passed to produce function.The measured pods configuration is properly propagated.
cmd/pod-scaler/producer.go (3)
42-54: LGTM! PromQL label value escaping is thorough.The function correctly escapes backslashes first (before quotes), then quotes, newlines, and carriage returns to prevent PromQL injection.
830-832: Verify BigQuery column alignment.
CreatedandLastMeasuredboth usepod.Timestamp. Ensure the BigQuery table schema has both columns and they should have the same value for initial writes.
426-427: Prometheus selector requires a label that doesn't appear to be set anywhere.The selector
{pod_scaler="measured",label_created_by_ci="true"}requires thelabel_created_by_cilabel, but this label is not referenced anywhere else in the codebase—it's not defined as a constant, set on pods, or used in the BigQuery query that sources measured pod data (which only checkspod_scaler_label = 'measured'). This suggests the label may never be applied to measured pods, making the selector ineffective or incomplete. Verify whetherlabel_created_by_cishould be removed from the selector or if the label application logic is missing.cmd/pod-scaler/admission.go (2)
43-55: LGTM! BigQuery client initialization with proper validation.The code correctly validates that BigQuery project and dataset IDs are provided when measured pods are enabled, and handles initialization failures appropriately.
117-126: LGTM! Measured pod handling correctly integrated into admission flow.The ordering is correct: classify first, then add anti-affinity, then apply measured resources before regular resource mutation.
cmd/pod-scaler/measured.go (7)
24-40: LGTM! Well-documented constants with clear purpose.The constants clearly define the pod classification labels, data retention period (10 days), and refresh interval (24 hours).
61-91: LGTM! Thread-safe cache implementation.The cache correctly uses RWMutex for read/write synchronization. The Update method replaces the entire map, which is appropriate for a full cache refresh from BigQuery.
204-225: LGTM! Robust JSON parsing with fallback.The code handles both int64 (nanoseconds) and string duration formats for backwards compatibility when parsing container durations from BigQuery.
265-295: LGTM! ClassifyPod correctly handles all scenarios.The function properly initializes labels if nil, checks all containers for measurement needs, and defaults to normal when BigQuery is unavailable to prevent overwhelming isolated nodes.
300-357: LGTM! Anti-affinity rules correctly implemented.Measured pods get complete isolation by avoiding all pod-scaler labeled pods, while normal pods only avoid measured pods. The code properly merges with existing anti-affinity terms rather than overwriting.
363-437: LGTM! Resource application logic is well-designed.The function correctly:
- Only applies to "normal" pods (those with measured data, not those being measured)
- Finds the longest-running container as the main workload
- Falls back to first container if no duration data
- Applies 20% safety buffer to measured values
195-202: Potential metadata mismatch between cache key and lookup.The
FullMetadatareconstructed from BigQuery only populatesOrg,Repo,Branch, andContainer. However,podscaler.MetadataFor(used inClassifyPodandApplyMeasuredPodResources) may also populate additional fields likeTarget,Step, andPod. If the cache lookup mechanism compares complete metadata objects rather than just the key fields, this could result in cache misses when the same logical pod metric is queried multiple times.
1fd5ab9 to
515ce35
Compare
|
/pipeline required |
|
Scheduling required tests: Scheduling tests matching the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/measured.go:
- Around line 195-202: The FullMetadata constructed in measured.go (meta :=
podscaler.FullMetadata {...}) only sets Metadata and Container, causing cache
key mismatches with MetadataFor/ShouldBeMeasured which expect Pod and
Target/Step; update the BigQuery SELECT and the code that builds meta to include
and assign Target and Pod (and Step if available) from the query result (or
modify the query to join the pod label data) so the FullMetadata keys stored
from BigQuery match the fully-populated metadata used during admission (refer to
the meta variable construction and the MetadataFor/ShouldBeMeasured code paths).
In @cmd/pod-scaler/producer.go:
- Around line 654-656: The node-level CPU query currently includes a namespace
filter which causes it to miss workloads from other namespaces; update the
PromQL used to build nodeCPUQuery (and the similar query referenced around line
689) to remove the namespace selector so it only filters by node and container
(e.g., container!="POD",container!="" and node="%s") thereby capturing total CPU
usage on the node for isolation validation; locate the string construction that
sets nodeCPUQuery and the analogous variable near line 689 and remove the
namespace="%s" clause from the metric selector.
- Around line 442-446: The PromQL selector used for the kube_pod_labels query is
wrong: change the measuredPodSelector value from
`{pod_scaler="measured",label_created_by_ci="true"}` to use the label_ prefix
that kube-state-metrics exposes, e.g.
`{label_pod_scaler="measured",label_created_by_ci="true"}`, so the
podLabelsQuery (fmt.Sprintf(`kube_pod_labels%s`, measuredPodSelector)) will
match the actual kube_pod_labels metric labels.
🧹 Nitpick comments (5)
cmd/pod-scaler/measured_test.go (1)
151-210: Consider verifying the anti-affinity term details.The test validates that anti-affinity is set, but doesn't assert on the specific label selector or operator used. This could miss regressions where the anti-affinity logic changes.
♻️ Optional: Add assertions for anti-affinity term details
// Test case 1: Measured pod should avoid all pod-scaler labeled pods measuredPod := &corev1.Pod{ ObjectMeta: metav1.ObjectMeta{ Name: "measured-pod", Labels: map[string]string{ PodScalerLabelKey: PodScalerLabelValueMeasured, }, }, Spec: corev1.PodSpec{}, } AddPodAntiAffinity(measuredPod, logger) if measuredPod.Spec.Affinity == nil { t.Fatal("Expected affinity to be set") } if measuredPod.Spec.Affinity.PodAntiAffinity == nil { t.Fatal("Expected podAntiAffinity to be set") } - if len(measuredPod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution) == 0 { - t.Fatal("Expected required anti-affinity terms to be set") - } + terms := measuredPod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution + if len(terms) == 0 { + t.Fatal("Expected required anti-affinity terms to be set") + } + // Verify measured pods avoid ALL pod-scaler labeled pods (OpExists) + if terms[0].LabelSelector.MatchExpressions[0].Operator != metav1.LabelSelectorOpExists { + t.Errorf("Expected OpExists for measured pod, got %v", terms[0].LabelSelector.MatchExpressions[0].Operator) + }cmd/pod-scaler/main.go (1)
81-86: Add validation for BigQuery flags when measured pods is enabled.The flag definitions are correct, but validation that
bigquery-project-idandbigquery-dataset-idare required whenenable-measured-podsis true should be invalidate(), not deferred to runtime in admission.go. Early validation provides better UX.♻️ Add validation in options.validate()
Add this check in the
validate()function around line 157:if o.producerOptions.enableMeasuredPods { if o.producerOptions.bigQueryProjectID == "" { return errors.New("--bigquery-project-id is required when --enable-measured-pods is true") } if o.producerOptions.bigQueryDatasetID == "" { return errors.New("--bigquery-dataset-id is required when --enable-measured-pods is true") } }cmd/pod-scaler/producer.go (1)
816-856: Unused parameterprojectIDin writeMeasuredPodsToBigQuery.The
projectIDparameter is explicitly ignored with_. If it's not needed (sincebqClientalready has the project context), consider removing it from the signature to reduce confusion.♻️ Remove unused parameter
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error { +func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {And update the call site at line 496:
- if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil { + if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {cmd/pod-scaler/admission.go (1)
321-321: Unused parametersauthoritativeCPUandauthoritativeMemory.These parameters are passed through but not used in the function body. The summary mentions they were added as new fields, but the implementation ignores them with
_. If these are placeholders for future work, consider adding a TODO comment.♻️ Add TODO comment or remove unused parameters
If these are intended for future use:
func mutatePodResources(pod *corev1.Pod, server *resourceServer, mutateResourceLimits bool, cpuCap int64, memoryCap string, authoritativeCPU, authoritativeMemory bool /* TODO: implement authoritative mode */, reporter results.PodScalerReporter, logger *logrus.Entry) {Otherwise, consider removing them until they're needed.
cmd/pod-scaler/measured.go (1)
398-406: Fallback to first container may not be ideal.When no duration data exists, the code falls back to the first container. However, if the first container happens to be a sidecar (e.g.,
istio-proxy), resources would be applied to the wrong container. Consider logging a warning when using the fallback.♻️ Add warning log for fallback case
// If we don't have duration data, just use the first container as a fallback. if longestContainer == nil && len(pod.Spec.Containers) > 0 { longestContainer = &pod.Spec.Containers[0] + logger.Warnf("No duration data available, falling back to first container: %s", longestContainer.Name) }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/producer.gocmd/pod-scaler/main.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (4)
cmd/pod-scaler/producer.go (2)
cmd/pod-scaler/storage.go (1)
Cache(24-28)pkg/metrics/insights.go (1)
Context(38-38)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
BigQueryClient(94-101)NewMeasuredPodCache(70-75)NewBigQueryClient(104-137)ClassifyPod(265-295)AddPodAntiAffinity(300-357)ApplyMeasuredPodResources(363-437)
cmd/pod-scaler/measured_test.go (2)
cmd/pod-scaler/measured.go (8)
NewMeasuredPodCache(70-75)BigQueryClient(94-101)MeasuredPodData(44-59)ClassifyPod(265-295)PodScalerLabelKey(27-27)PodScalerLabelValueMeasured(33-33)PodScalerLabelValueNormal(30-30)AddPodAntiAffinity(300-357)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (15)
cmd/pod-scaler/measured_test.go (3)
16-61: LGTM - Good test coverage for ShouldBeMeasured.The test covers the three key scenarios: no data (should measure), recent data (should not measure), and stale data (should measure). The time boundaries align with the 10-day retention constant.
63-149: LGTM - ClassifyPod test covers essential paths.Test cases appropriately cover:
- Pod with no cached data → measured classification
- Pod with recent data → normal classification
- Nil BigQuery client → defaults to normal (graceful degradation)
The use of
MetadataForin test case 2 ensures the metadata construction matches what happens in production.
212-253: LGTM - Cache test covers basic Get/Update operations.The test validates cache miss, update, and retrieval with correct field values.
cmd/pod-scaler/main.go (3)
54-62: LGTM - producerOptions struct extended appropriately.The new fields for measured pods configuration are logically grouped with other producer settings.
255-256: LGTM - Producer invocation updated with new parameters.The parameters are passed correctly to the
producefunction.
279-281: LGTM - Admission invocation updated with measured pods config.The comment clarifies the shared use of producerOptions, and the parameters are passed correctly.
cmd/pod-scaler/producer.go (2)
42-54: LGTM - PromQL escaping prevents injection.The escaping handles the critical characters (backslash, quote, newline, carriage return) in the correct order.
101-139: LGTM - BigQuery client lifecycle is well-managed.Good use of
sync.Oncefor cleanup andinterrupts.OnInterruptfor graceful shutdown. The comment explains the rationale for closing in once mode vs non-once mode.cmd/pod-scaler/admission.go (2)
43-55: LGTM - BigQuery client initialization with proper validation.The validation for required flags when
enableMeasuredPodsis true is appropriate here as a fatal error during admission startup.
117-126: LGTM - Measured pod logic integration is well-structured.The order of operations is correct:
- Classify pod (set label)
- Add anti-affinity (based on label)
- Apply measured resources (for "normal" pods with data)
- Then apply regular resource mutation
This ensures measured data takes precedence before histogram-based recommendations are applied.
cmd/pod-scaler/measured.go (5)
24-40: LGTM - Well-documented constants.The constants are appropriately named and documented with clear explanations of their purpose.
85-91: Cache.Update replaces the entire map atomically.The current implementation replaces the entire cache map on update. This is safe but means any concurrent
Getcalls see either the old or new data entirely. If incremental updates are needed in the future, consider a merge strategy.
146-163: BigQuery query uses parameterized values safely.The query uses
fmt.Sprintfwith values that come from struct fields (projectID,datasetID) and constants (MeasuredPodDataRetentionDays), not user input. This is safe from injection.
349-356: LGTM - Anti-affinity terms are merged correctly.The implementation properly appends to existing terms rather than overwriting, preserving any pre-existing anti-affinity rules on the pod.
422-428: LGTM - Resource application with 20% buffer.The 20% buffer on measured CPU/memory aligns with the existing
useOursIfLargerlogic in admission.go (line 246), providing consistency in safety margins.
515ce35 to
8fe3660
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @cmd/pod-scaler/producer.go:
- Around line 42-54: The escapePromQLLabelValue function currently escapes
backslashes, double quotes, newlines and carriage returns but omits tab
characters; update escapePromQLLabelValue to also replace tab characters by
adding a replacement of "\t" with `\t` (ensure this replacement is applied
alongside the existing strings.ReplaceAll calls in the function so tabs are
properly escaped in PromQL label values).
🧹 Nitpick comments (6)
cmd/pod-scaler/main.go (1)
81-86: Consider adding validation for BigQuery flags when measured pods are enabled.The flag bindings are correct, but there's no validation in
validate()to ensurebigquery-project-idandbigquery-dataset-idare provided whenenable-measured-podsis true for producer mode. The admission mode validates this (in admission.go line 45-47), but the producer mode relies on the condition check in producer.go line 104 which silently skips BigQuery setup if IDs are empty.This asymmetry could lead to confusion where measured pods is "enabled" but silently does nothing in producer mode.
Suggested validation in validate()
func (o *options) validate() error { switch o.mode { case "producer": - return o.kubernetesOptions.Validate(false) + if err := o.kubernetesOptions.Validate(false); err != nil { + return err + } + if o.producerOptions.enableMeasuredPods { + if o.producerOptions.bigQueryProjectID == "" || o.producerOptions.bigQueryDatasetID == "" { + return errors.New("--bigquery-project-id and --bigquery-dataset-id are required when --enable-measured-pods is true") + } + } case "consumer.ui":cmd/pod-scaler/producer.go (2)
529-800: Consider extracting helper functions from this large function.
extractMeasuredPodDatais ~270 lines and handles multiple responsibilities:
- Pod metadata extraction
- CPU metrics collection and min/max calculation
- Memory metrics collection and min/max calculation
- Container duration extraction
- Node-level validation metrics
- Record assembly
The min/max calculation loops (lines 591-617 and 619-645) are nearly identical and could be extracted.
Example helper for min/max extraction
// findMinMax extracts min and max values from a Prometheus matrix for a given value extractor func findMinMax[T constraints.Ordered](matrix model.Matrix, containerKey string, getValue func(model.SampleValue) T) (map[string]T, map[string]T) { minByContainer := make(map[string]T) maxByContainer := make(map[string]T) for _, stream := range matrix { container := string(stream.Metric[containerKey]) var minVal, maxVal T first := true for _, sample := range stream.Values { val := getValue(sample.Value) if first { minVal, maxVal = val, val first = false } else { if val < minVal { minVal = val } if val > maxVal { maxVal = val } } } // Only record if we have data if !first { minByContainer[container] = minVal maxByContainer[container] = maxVal } } return minByContainer, maxByContainer }
824-866: Unused parameter should be removed.The
projectIDparameter is explicitly ignored with_. If it's not needed, remove it from the function signature.Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error { +func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {And update the call site at line 497:
- if err := writeMeasuredPodsToBigQuery(ctx, bqClient, projectID, datasetID, allMeasuredPods, logger); err != nil { + if err := writeMeasuredPodsToBigQuery(ctx, bqClient, datasetID, allMeasuredPods, logger); err != nil {cmd/pod-scaler/admission.go (1)
36-66: Function has too many parameters - consider using an options struct.The
admit()function now has 16 parameters, which makes it difficult to maintain and call correctly. Consider grouping related parameters into configuration structs.Suggested refactor using options struct
type admitOptions struct { port int healthPort int certDir string mutateResourceLimits bool cpuCap int64 memoryCap string cpuPriorityScheduling int64 authoritativeCPURequests bool authoritativeMemoryRequests bool enableMeasuredPods bool bigQueryProjectID string bigQueryDatasetID string bigQueryCredentialsFile string } func admit(opts admitOptions, client buildclientv1.BuildV1Interface, loaders map[string][]*cacheReloader, reporter results.PodScalerReporter) { // ... }cmd/pod-scaler/measured.go (2)
146-165: Use parameterized queries to prevent SQL injection risks.While the values being interpolated (
projectID,datasetID,MeasuredPodDataRetentionDays) are from trusted configuration, using string interpolation in SQL queries is a risky pattern. BigQuery supports parameterized queries that would be safer.Suggested safer query construction
- query := bq.client.Query(fmt.Sprintf(` + queryStr := ` SELECT org, repo, branch, target, container, pod_name, MAX(max_cpu) as max_cpu, MAX(max_memory) as max_memory, MAX(created) as last_measured, ANY_VALUE(container_durations) as container_durations FROM - `+"`%s.%s.ci_operator_metrics`"+` + ` + fmt.Sprintf("`%s.%s.ci_operator_metrics`", bq.projectID, bq.datasetID) + ` WHERE pod_scaler_label = 'measured' - AND created >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL %d DAY) + AND created >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL @retention_days DAY) GROUP BY org, repo, branch, target, container, pod_name - `, bq.projectID, bq.datasetID, MeasuredPodDataRetentionDays)) + ` + query := bq.client.Query(queryStr) + query.Parameters = []bigquery.QueryParameter{ + {Name: "retention_days", Value: MeasuredPodDataRetentionDays}, + }Note: Table names cannot be parameterized in BigQuery, so the project/dataset interpolation must remain as-is.
404-421: Fallback to first container when no duration data is reasonable.If no container duration data exists, using the first container as fallback is a sensible default. However, this means the first container might get resource increases even if it's a sidecar.
Consider logging at info/warn level when falling back, so operators can identify pods with missing duration data.
Suggested improvement
// If we don't have duration data, just use the first container as a fallback. if longestContainer == nil && len(pod.Spec.Containers) > 0 { longestContainer = &pod.Spec.Containers[0] + logger.Infof("No container duration data available, falling back to first container: %s", longestContainer.Name) }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.go
✅ Files skipped from review due to trivial changes (1)
- cmd/pod-scaler/measured_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gocmd/pod-scaler/producer.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (2)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (6)
BigQueryClient(94-101)NewMeasuredPodCache(70-75)NewBigQueryClient(104-137)ClassifyPod(271-301)AddPodAntiAffinity(306-363)ApplyMeasuredPodResources(369-443)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (15)
cmd/pod-scaler/main.go (3)
54-62: New BigQuery configuration fields look correct.The producerOptions struct is cleanly extended with the measured pods configuration. The field naming is consistent and clear.
255-256: Function call updated correctly for new parameters.The produce() invocation properly passes all new BigQuery configuration fields from producerOptions.
279-281: Admission wiring is correct.The admit() call correctly passes the measured pods configuration including BigQuery credentials. The comment explaining shared use of producerOptions is helpful.
cmd/pod-scaler/producer.go (2)
101-139: BigQuery client lifecycle management is well-designed.The use of
sync.Oncefor client cleanup and registering withinterrupts.OnInterruptensures proper resource management. The handling of both once-mode and continuous-mode is correct.
433-506: Measured pod collection logic is sound.The function correctly:
- Uses a 4-hour query window for recently completed pods
- Deduplicates pods by namespace/name
- Handles errors gracefully per-cluster without failing entirely
- Batches all data before writing to BigQuery
One observation: the query window of 4 hours might miss pods that ran longer if the collection runs every 2 hours. This seems intentional for catching "recently completed" pods.
cmd/pod-scaler/admission.go (3)
43-55: BigQuery client initialization with validation is correct.The validation ensures BigQuery IDs are provided when measured pods is enabled, and properly creates the client with cache. Good error handling with
logrus.Fatalfor configuration errors.
117-127: Measured pod handling flow is well-structured.The order of operations is correct:
- ClassifyPod - determine if pod needs measurement
- AddPodAntiAffinity - set scheduling constraints
- ApplyMeasuredPodResources - apply measured resource requests
- mutatePodResources - apply regular pod-scaler recommendations
This ensures measured data is applied before the regular mutation logic.
321-340: Remove unused parametersauthoritativeCPUandauthoritativeMemoryfrom the function signature.These parameters are passed to the function at line 126 but explicitly ignored with
_comments. Either remove them from the function signature and update the call site, or implement their intended functionality if needed.cmd/pod-scaler/measured.go (7)
24-40: Constants are well-documented and appropriately scoped.The constants clearly explain their purpose. The 10-day retention and 24-hour refresh interval seem reasonable for the use case.
61-91: Thread-safe cache implementation is correct.Good use of
sync.RWMutexfor concurrent access. TheUpdate()method atomically replaces the entire map, which is safe.
103-137: BigQuery client initialization starts refresh in constructor.The constructor performs an initial refresh and schedules periodic refreshes. Note that if the initial refresh fails, it only logs a warning and continues - this is intentional to allow startup even if BigQuery is temporarily unavailable, but it means the cache will be empty until the next successful refresh.
248-261: ShouldBeMeasured logic is clear and correct.Returns true if:
- No existing measurement data exists, OR
- The last measurement is older than 10 days
This aligns with the PR objective of re-measuring stale data.
306-363: Anti-affinity rules correctly implement isolation requirements.The logic is sound:
- Measured pods avoid ALL pods with the
pod-scalerlabel (complete isolation)- Normal pods avoid only measured pods (allows normal pods to share nodes)
- Existing anti-affinity terms are preserved via merge
This ensures measured pods get dedicated nodes for accurate resource measurement.
386-402: Loop continues after finding a container with longer duration.The loop iterates through all containers even after finding a longer-running one. This is correct behavior since we need to check all containers to find the actual longest, but the early
continueon line 392 when!existsis efficient.
428-442: Resource application with 20% safety buffer is consistent with existing code.The 20% buffer (multiplier of 1.2) matches the existing pattern in
useOursIfLarger(admission.go line 246). Good consistency in approach.
| // escapePromQLLabelValue escapes special characters in PromQL label values. | ||
| // PromQL label values in selectors must be quoted and have backslashes and quotes escaped. | ||
| func escapePromQLLabelValue(value string) string { | ||
| // Replace backslashes first (before replacing quotes) | ||
| value = strings.ReplaceAll(value, `\`, `\\`) | ||
| // Escape double quotes | ||
| value = strings.ReplaceAll(value, `"`, `\"`) | ||
| // Escape newlines | ||
| value = strings.ReplaceAll(value, "\n", `\n`) | ||
| // Escape carriage returns | ||
| value = strings.ReplaceAll(value, "\r", `\r`) | ||
| return value | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing tab character escape in PromQL label value sanitization.
The function escapes backslashes, quotes, newlines, and carriage returns, but misses tab characters (\t) which are also problematic in PromQL label values.
Suggested fix
func escapePromQLLabelValue(value string) string {
// Replace backslashes first (before replacing quotes)
value = strings.ReplaceAll(value, `\`, `\\`)
// Escape double quotes
value = strings.ReplaceAll(value, `"`, `\"`)
// Escape newlines
value = strings.ReplaceAll(value, "\n", `\n`)
// Escape carriage returns
value = strings.ReplaceAll(value, "\r", `\r`)
+ // Escape tabs
+ value = strings.ReplaceAll(value, "\t", `\t`)
return value
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // escapePromQLLabelValue escapes special characters in PromQL label values. | |
| // PromQL label values in selectors must be quoted and have backslashes and quotes escaped. | |
| func escapePromQLLabelValue(value string) string { | |
| // Replace backslashes first (before replacing quotes) | |
| value = strings.ReplaceAll(value, `\`, `\\`) | |
| // Escape double quotes | |
| value = strings.ReplaceAll(value, `"`, `\"`) | |
| // Escape newlines | |
| value = strings.ReplaceAll(value, "\n", `\n`) | |
| // Escape carriage returns | |
| value = strings.ReplaceAll(value, "\r", `\r`) | |
| return value | |
| } | |
| // escapePromQLLabelValue escapes special characters in PromQL label values. | |
| // PromQL label values in selectors must be quoted and have backslashes and quotes escaped. | |
| func escapePromQLLabelValue(value string) string { | |
| // Replace backslashes first (before replacing quotes) | |
| value = strings.ReplaceAll(value, `\`, `\\`) | |
| // Escape double quotes | |
| value = strings.ReplaceAll(value, `"`, `\"`) | |
| // Escape newlines | |
| value = strings.ReplaceAll(value, "\n", `\n`) | |
| // Escape carriage returns | |
| value = strings.ReplaceAll(value, "\r", `\r`) | |
| // Escape tabs | |
| value = strings.ReplaceAll(value, "\t", `\t`) | |
| return value | |
| } |
🤖 Prompt for AI Agents
In @cmd/pod-scaler/producer.go around lines 42 - 54, The escapePromQLLabelValue
function currently escapes backslashes, double quotes, newlines and carriage
returns but omits tab characters; update escapePromQLLabelValue to also replace
tab characters by adding a replacement of "\t" with `\t` (ensure this
replacement is applied alongside the existing strings.ReplaceAll calls in the
function so tabs are properly escaped in PromQL label values).
|
Scheduling required tests: Scheduling tests matching the |
8fe3660 to
bfc4f68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@cmd/pod-scaler/admission.go`:
- Around line 63-64: The AddPodAntiAffinity merge incorrectly appends
requiredTerms to itself; in the logic where existingTerms is read from
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
replace the buggy append(requiredTerms, requiredTerms...) with
append(requiredTerms, existingTerms...) so existing terms are merged into
requiredTerms (ensure you reference AddPodAntiAffinity and the local variables
requiredTerms and existingTerms when making the change).
♻️ Duplicate comments (3)
cmd/pod-scaler/producer.go (1)
41-53: Missing tab character escape in PromQL label value sanitization.Per a previous review comment that wasn't fully addressed, the function escapes backslashes, quotes, newlines, and carriage returns, but misses tab characters (
\t) which can also be problematic in PromQL label values.Suggested fix
func escapePromQLLabelValue(value string) string { // Replace backslashes first (before replacing quotes) value = strings.ReplaceAll(value, `\`, `\\`) // Escape double quotes value = strings.ReplaceAll(value, `"`, `\"`) // Escape newlines value = strings.ReplaceAll(value, "\n", `\n`) // Escape carriage returns value = strings.ReplaceAll(value, "\r", `\r`) + // Escape tabs + value = strings.ReplaceAll(value, "\t", `\t`) return value }cmd/pod-scaler/measured.go (2)
96-104: Add aClose()method to release the underlying BigQuery client.The
BigQueryClientstruct holds a*bigquery.Clientbut doesn't expose a method to close it. This can leave resources open for the application's lifetime. Add aClose()method:func (bq *BigQueryClient) Close() error { return bq.client.Close() }Then ensure callers invoke it during shutdown.
127-130: Consider retry logic or a "not ready" state for initial BigQuery refresh failure.If BigQuery is unavailable at startup,
Refresh()fails with only a warning. The cache remains empty, so all pods will initially be classified based on the absence of data. ForShouldBeMeasured(), this returnstruefor all pods (since no data exists), potentially overwhelming isolated nodes.Consider either:
- Implementing retry logic with backoff before proceeding
- Tracking a "ready" state and defaulting to a safe behavior until refresh succeeds
- Failing fast if the feature is enabled but BigQuery isn't functional
🧹 Nitpick comments (3)
cmd/pod-scaler/main.go (1)
255-269: Variable shadowing:cacheshadows the outerCachevariable.The variable
cacheon line 260 shadows thecache Cachedeclared on line 185. While this doesn't cause a bug (since the outercacheis already used before this block), it reduces readability. Consider renaming tomeasuredPodCachefor clarity.Suggested rename
- cache := NewMeasuredPodCache(logrus.WithField("component", "measured-pods-producer")) + measuredPodCache := NewMeasuredPodCache(logrus.WithField("component", "measured-pods-producer")) var err error - bqClient, err = NewBigQueryClient(opts.producerOptions.bigQueryProjectID, opts.producerOptions.bigQueryDatasetID, opts.producerOptions.bigQueryCredentialsFile, cache, logrus.WithField("component", "measured-pods-producer")) + bqClient, err = NewBigQueryClient(opts.producerOptions.bigQueryProjectID, opts.producerOptions.bigQueryDatasetID, opts.producerOptions.bigQueryCredentialsFile, measuredPodCache, logrus.WithField("component", "measured-pods-producer"))cmd/pod-scaler/producer.go (1)
838-838: Unused parameterprojectIDin function signature.The
projectIDparameter is declared but explicitly ignored with_. Since the BigQuery client already has the project context from initialization, this parameter is redundant. Consider removing it from the signature.Suggested fix
-func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, _ /* projectID */, datasetID string, pods []measuredPodData, logger *logrus.Entry) error { +func writeMeasuredPodsToBigQuery(ctx context.Context, bqClient *bigquery.Client, datasetID string, pods []measuredPodData, logger *logrus.Entry) error {And update the call site at line 485:
- if err := writeMeasuredPodsToBigQuery(ctx, bqClient.client, bqClient.projectID, bqClient.datasetID, allMeasuredPods, logger); err != nil { + if err := writeMeasuredPodsToBigQuery(ctx, bqClient.client, bqClient.datasetID, allMeasuredPods, logger); err != nil {cmd/pod-scaler/measured.go (1)
405-416: Consider logging when falling back to first container.The fallback path silently uses the first container with measured data when duration data is unavailable. Adding a debug log here would help with troubleshooting when resource recommendations aren't applied as expected.
💡 Suggested enhancement
// If we don't have duration data, just use the first container with measured data as a fallback. if longestContainer == nil && len(pod.Spec.Containers) > 0 { + logger.Debug("No container duration data available, falling back to first container with measured data") for i := range pod.Spec.Containers {
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (6)
cmd/pod-scaler/admission.gocmd/pod-scaler/main.gocmd/pod-scaler/measured.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.gopkg/manifestpusher/manifestpusher.go
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
cmd/pod-scaler/main.gopkg/manifestpusher/manifestpusher.gocmd/pod-scaler/measured_test.gocmd/pod-scaler/producer.gocmd/pod-scaler/admission.gocmd/pod-scaler/measured.go
🧬 Code graph analysis (3)
cmd/pod-scaler/main.go (1)
cmd/pod-scaler/measured.go (3)
BigQueryClient(97-104)NewMeasuredPodCache(73-78)NewBigQueryClient(107-140)
cmd/pod-scaler/admission.go (1)
cmd/pod-scaler/measured.go (4)
BigQueryClient(97-104)ClassifyPod(274-301)AddPodAntiAffinity(306-362)ApplyMeasuredPodResources(368-443)
cmd/pod-scaler/measured.go (2)
pkg/api/types.go (1)
Metadata(161-166)pkg/pod-scaler/types.go (2)
FullMetadata(374-385)MetadataFor(203-208)
🔇 Additional comments (16)
pkg/manifestpusher/manifestpusher.go (1)
44-56: LGTM!Architecture extraction from nodeSelector with proper validation, and the image reference construction logic is correct.
cmd/pod-scaler/measured_test.go (4)
16-61: LGTM!The test cases cover the key scenarios for
ShouldBeMeasured: no data (should measure), recent data (should not measure), and stale data (should measure). The threshold logic is properly tested.
63-149: LGTM!Test case 2 correctly uses
MetadataForto construct metadata that matches production code behavior (addressing the previous review concern). The three test cases cover: no data (measured), recent data (normal), and nil BigQuery client (normal fallback).
151-210: LGTM!Anti-affinity tests verify the expected behavior for measured pods (avoid all pod-scaler pods), normal pods (avoid measured pods), and unlabeled pods (no anti-affinity).
212-253: LGTM!Cache tests verify basic Get/Update operations and field values.
cmd/pod-scaler/main.go (1)
293-308: LGTM!The admission path correctly initializes the BigQuery client when the measured pods feature is enabled, and passes it to the
admitfunction.cmd/pod-scaler/admission.go (2)
101-110: LGTM!The measured pods integration flow is correctly ordered: classify the pod first, then add anti-affinity rules, then apply measured resources before the standard mutation path. This ensures measured pod resources are set before
mutatePodResourcesruns.
36-47: LGTM!The
admitfunction signature andpodMutatorstruct are correctly extended to include the BigQuery client for measured pods support.cmd/pod-scaler/producer.go (4)
89-127: LGTM!The BigQuery client lifecycle management is well implemented:
- Uses
sync.Onceto ensure the client is closed exactly once- Registers cleanup via
interrupts.OnInterruptfor graceful shutdown- In once mode, explicitly closes after work completes
This addresses the previous review concern about premature client closure.
421-494: LGTM!The
collectMeasuredPodMetricsfunction correctly:
- Uses the proper
label_pod_scalerprefix for kube_pod_labels queries- Deduplicates pods by namespace/name
- Aggregates data from all clusters before writing to BigQuery
554-589: LGTM!The PromQL queries are now correctly formed without the invalid
max_over_timewrapper with timestamps. The code properly usesQueryRangeand computes min/max from the returned samples in Go.
661-735: LGTM!The node-level metrics queries now correctly omit the namespace filter, capturing total node utilization for isolation validation as suggested in previous reviews.
cmd/pod-scaler/measured.go (4)
251-269: LGTM!The
ShouldBeMeasuredlogic correctly identifies pods needing measurement (no data or stale data beyond retention period). The implementation is clean and the cutoff calculation is correct.
271-301: LGTM!The classification logic correctly defaults to "normal" when BigQuery is unavailable, preventing isolated nodes from being overwhelmed. The per-container check with early exit is efficient.
303-362: LGTM!The anti-affinity logic now correctly merges with existing terms rather than overwriting them. The differentiated rules for measured vs normal pods ensure proper isolation.
386-403: Pod name in cache key may cause lookup failures for pods with instance-specific names.The cache lookup includes
pod.ObjectMeta.Namein the FullMetadata key (viaMetadataFor). The code already recognizes this issue for certain pod types—RPM repo and ProwJob pods have their pod names cleared from the key since those names are dynamically generated or irrelevant.However, regular measured pods retain the pod name in the cache key. If pod instance names change between measurement runs (common in CI environments where pod names include hashes or timestamps), lookups will fail to find historical data stored under previous pod instance names.
Consider whether regular measured pod metadata should also exclude the pod instance name to match measured data across pod lifecycle variations, consistent with how the code handles RPM repo and ProwJob pods.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| bqClient *BigQueryClient | ||
| reporter results.PodScalerReporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug in AddPodAntiAffinity referenced from this file: existing terms are duplicated instead of merged.
In cmd/pod-scaler/measured.go lines 355-357 (used by this admission flow), there's a bug in merging anti-affinity terms:
existingTerms := pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
if existingTerms != nil {
requiredTerms = append(requiredTerms, requiredTerms...) // BUG: appends requiredTerms to itself
}This should append existingTerms to requiredTerms, not requiredTerms to itself.
Proposed fix in measured.go
// Merge with existing anti-affinity terms instead of overwriting
existingTerms := pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution
if existingTerms != nil {
- requiredTerms = append(requiredTerms, requiredTerms...)
+ requiredTerms = append(existingTerms, requiredTerms...)
}
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution = requiredTerms🤖 Prompt for AI Agents
In `@cmd/pod-scaler/admission.go` around lines 63 - 64, The AddPodAntiAffinity
merge incorrectly appends requiredTerms to itself; in the logic where
existingTerms is read from
pod.Spec.Affinity.PodAntiAffinity.RequiredDuringSchedulingIgnoredDuringExecution,
replace the buggy append(requiredTerms, requiredTerms...) with
append(requiredTerms, existingTerms...) so existing terms are merged into
requiredTerms (ensure you reference AddPodAntiAffinity and the local variables
requiredTerms and existingTerms when making the change).
| err = wait.ExponentialBackoff(backoff, func() (bool, error) { | ||
| digest, length, err = registry.PushManifestList( | ||
| "", // username: we don't we use basic auth | ||
| "", // password: " | ||
| types.YAMLInput{Image: targetImage, Manifests: srcImages}, | ||
| false, // --ignore-missing. We don't want to ignore missing images. | ||
| true, // --insecure to allow pushing to the local registry. | ||
| false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry. | ||
| types.Docker, // we only need docker type manifest. | ||
| m.dockercfgPath, | ||
| ) | ||
| if err != nil { | ||
| // Check if the error indicates missing images (common race condition) | ||
| errStr := err.Error() | ||
| if strings.Contains(errStr, "no image found in manifest list") || | ||
| strings.Contains(errStr, "inspect of image") || | ||
| strings.Contains(errStr, "failed to pull image") || | ||
| strings.Contains(errStr, "choosing an image from manifest list") || | ||
| strings.Contains(errStr, "PullBuilderImageFailed") { | ||
| m.logger.Warnf("Images not yet available in registry, retrying: %v", err) | ||
| return false, nil // Retry | ||
| } | ||
| // For other errors, fail immediately | ||
| return false, err | ||
| } | ||
| return true, nil | ||
| }) | ||
|
|
||
| if err != nil { | ||
| return err | ||
| return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry error handling loses the underlying cause on timeout.
When wait.ExponentialBackoff exhausts all retries, it returns wait.ErrWaitTimeout instead of the last error from the condition function. This means the error message at line 108 will say "failed to push manifest list... context deadline exceeded" rather than the actual registry error that caused retries.
Proposed fix to preserve the last error
var digest string
var length int
var err error
+ var lastErr error
err = wait.ExponentialBackoff(backoff, func() (bool, error) {
- digest, length, err = registry.PushManifestList(
+ digest, length, lastErr = registry.PushManifestList(
"", // username: we don't we use basic auth
"", // password: "
types.YAMLInput{Image: targetImage, Manifests: srcImages},
false, // --ignore-missing. We don't want to ignore missing images.
true, // --insecure to allow pushing to the local registry.
false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry.
types.Docker, // we only need docker type manifest.
m.dockercfgPath,
)
- if err != nil {
+ if lastErr != nil {
// Check if the error indicates missing images (common race condition)
- errStr := err.Error()
+ errStr := lastErr.Error()
if strings.Contains(errStr, "no image found in manifest list") ||
strings.Contains(errStr, "inspect of image") ||
strings.Contains(errStr, "failed to pull image") ||
strings.Contains(errStr, "choosing an image from manifest list") ||
strings.Contains(errStr, "PullBuilderImageFailed") {
- m.logger.Warnf("Images not yet available in registry, retrying: %v", err)
+ m.logger.Warnf("Images not yet available in registry, retrying: %v", lastErr)
return false, nil // Retry
}
// For other errors, fail immediately
- return false, err
+ return false, lastErr
}
return true, nil
})
if err != nil {
- return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err)
+ if errors.Is(err, wait.ErrWaitTimeout) && lastErr != nil {
+ return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, lastErr)
+ }
+ return fmt.Errorf("failed to push manifest list for %s: %w", targetImageRef, err)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| err = wait.ExponentialBackoff(backoff, func() (bool, error) { | |
| digest, length, err = registry.PushManifestList( | |
| "", // username: we don't we use basic auth | |
| "", // password: " | |
| types.YAMLInput{Image: targetImage, Manifests: srcImages}, | |
| false, // --ignore-missing. We don't want to ignore missing images. | |
| true, // --insecure to allow pushing to the local registry. | |
| false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry. | |
| types.Docker, // we only need docker type manifest. | |
| m.dockercfgPath, | |
| ) | |
| if err != nil { | |
| // Check if the error indicates missing images (common race condition) | |
| errStr := err.Error() | |
| if strings.Contains(errStr, "no image found in manifest list") || | |
| strings.Contains(errStr, "inspect of image") || | |
| strings.Contains(errStr, "failed to pull image") || | |
| strings.Contains(errStr, "choosing an image from manifest list") || | |
| strings.Contains(errStr, "PullBuilderImageFailed") { | |
| m.logger.Warnf("Images not yet available in registry, retrying: %v", err) | |
| return false, nil // Retry | |
| } | |
| // For other errors, fail immediately | |
| return false, err | |
| } | |
| return true, nil | |
| }) | |
| if err != nil { | |
| return err | |
| return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, err) | |
| var digest string | |
| var length int | |
| var err error | |
| var lastErr error | |
| err = wait.ExponentialBackoff(backoff, func() (bool, error) { | |
| digest, length, lastErr = registry.PushManifestList( | |
| "", // username: we don't we use basic auth | |
| "", // password: " | |
| types.YAMLInput{Image: targetImage, Manifests: srcImages}, | |
| false, // --ignore-missing. We don't want to ignore missing images. | |
| true, // --insecure to allow pushing to the local registry. | |
| false, // --plain-http is false by default in manifest-tool. False for the OpenShift registry. | |
| types.Docker, // we only need docker type manifest. | |
| m.dockercfgPath, | |
| ) | |
| if lastErr != nil { | |
| // Check if the error indicates missing images (common race condition) | |
| errStr := lastErr.Error() | |
| if strings.Contains(errStr, "no image found in manifest list") || | |
| strings.Contains(errStr, "inspect of image") || | |
| strings.Contains(errStr, "failed to pull image") || | |
| strings.Contains(errStr, "choosing an image from manifest list") || | |
| strings.Contains(errStr, "PullBuilderImageFailed") { | |
| m.logger.Warnf("Images not yet available in registry, retrying: %v", lastErr) | |
| return false, nil // Retry | |
| } | |
| // For other errors, fail immediately | |
| return false, lastErr | |
| } | |
| return true, nil | |
| }) | |
| if err != nil { | |
| if errors.Is(err, wait.ErrWaitTimeout) && lastErr != nil { | |
| return fmt.Errorf("failed to push manifest list for %s after retries: %w", targetImageRef, lastErr) | |
| } | |
| return fmt.Errorf("failed to push manifest list for %s: %w", targetImageRef, err) | |
| } |
|
@deepsm007: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implements measured pods feature to address CPU bottleneck issues from node contention. Pods are automatically classified as "normal" or "measured" based on measurement freshness (>10 days), with measured pods running on isolated nodes via pod anti-affinity rules. Integrates BigQuery to query and cache measured pod data from
ci_operator_metricstable, and adds data collection component that queries Prometheus for completed measured pods and writes max CPU/memory usage to BigQuery. Applies measured resource recommendations to longest-running containers using actual measured utilization instead of Prometheus data skewed by node contention. This enables accurate resource recommendations and addresses the CPU bottleneck issue (DPTP-4613) blocking authoritative CPU mode./cc @openshift/test-platform