Conversation
There was a problem hiding this comment.
Pull request overview
Adds support for overriding the container image used by the ClusterLoader2 CRI/image-pull benchmark so pipelines can exercise large, multi-layer “customer-like” images.
Changes:
- Plumbs a new
test_image/CL2_TEST_IMAGEparameter from the pipeline step into the CRI override generator (cri.py) and CL2 config. - Updates the deployment template to use a configurable image (AKS/Linux/memory path) and adds a topology spread constraint to improve node distribution.
- Refines the containerd throughput “AvgPerNode” PromQL query to exclude nodes with no pulls.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| steps/engine/clusterloader2/cri/execute.yml | Passes --test_image into the CRI override step via env/config. |
| modules/python/clusterloader2/cri/cri.py | Adds --test_image CLI arg and writes CL2_TEST_IMAGE into overrides. |
| modules/python/clusterloader2/cri/config/deployment_template.yaml | Introduces TestImage template param, uses it for the AKS/Linux/memory image, and adds topology spread constraints. |
| modules/python/clusterloader2/cri/config/containerd-measurements.yaml | Filters “AvgPerNode” to active nodes in PromQL. |
| modules/python/clusterloader2/cri/config/config.yaml | Wires CL2_TEST_IMAGE into the deployment template fill map. |
Comments suppressed due to low confidence (1)
modules/python/clusterloader2/cri/config/deployment_template.yaml:59
- When
TestImageis not the default, this template no longer renders an explicitcommand/argsfor the memory container (thestressblock is gated byif eq $TestImage ...). That changes the benchmark from a known long-running workload to whatever the image entrypoint does, which can cause early exits or add non-pull-related variance. Consider keeping a stable long-running command for all images, or making the command configurable alongsideTestImage.
{{if eq $TestImage "e2e-test-images/resource-consumer:1.13"}}
command:
- stress
args:
- --vm
| # Spread pods across nodes to ensure each node pulls the image | ||
| topologySpreadConstraints: | ||
| - maxSkew: 1 | ||
| topologyKey: kubernetes.io/hostname | ||
| whenUnsatisfiable: ScheduleAnyway |
There was a problem hiding this comment.
The comment says this topologySpreadConstraints block “ensure[s] each node pulls the image”, but with whenUnsatisfiable: ScheduleAnyway it’s only best-effort (it won’t guarantee one pod per node). Consider either softening the wording, or switching to DoNotSchedule if you need a hard guarantee.
| node_count, node_per_step, max_pods, repeats, operation_timeout, | ||
| load_type, scale_enabled, pod_startup_latency_threshold, provider, | ||
| registry_endpoint, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file): | ||
| registry_endpoint, test_image, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file): |
There was a problem hiding this comment.
override_config_clusterloader2 now requires the new test_image parameter, but existing in-repo callers (e.g., modules/python/tests/test_cri.py) invoke it without test_image, which will raise a TypeError and break the unit test suite. Consider making test_image optional with a default (matching the CLI default) and/or updating all internal call sites/tests to pass it explicitly.
| registry_endpoint, test_image, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file): | |
| registry_endpoint, test_image=None, os_type=None, scrape_kubelets=False, scrape_containerd=False, containerd_scrape_interval=None, host_network=False, override_file=None): |
| {{if eq $Provider "aks"}} | ||
| image: {{$RegistryEndpoint}}/e2e-test-images/resource-consumer:1.13 | ||
| image: {{$RegistryEndpoint}}/{{$TestImage}} | ||
| {{else}} | ||
| image: registry.k8s.io/e2e-test-images/resource-consumer:1.13 | ||
| {{end}} |
There was a problem hiding this comment.
TestImage is added as a template parameter, but it’s only used in the Linux/memory branch (Provider == aks). The CPU branch (and non-AKS providers) still uses a hard-coded image, so setting CL2_TEST_IMAGE will be ignored in those cases. If custom image support is intended generally for this scenario, apply it consistently (or document the limitation).
| image-pull-10pods-authenticated: | ||
| node_count: 10 | ||
| max_pods: 30 | ||
| max_pods: 26 |
There was a problem hiding this comment.
Read the summary, its because of node resource constrain.
There was a problem hiding this comment.
Is node autoscaling enabled on the node pool?
There was a problem hiding this comment.
No, auto scaling is not enabled as shown in this line.
Summary
Enables the use of custom container images in image pull benchmarks, allowing more realistic testing with large, multi-layer images that match real customer workloads.
Changes
TestImageparameter todeployment_template.yamlto allow overriding the default test imagecri.pyandexecute.ymlto pass through thetest_imageparameter from pipeline configurationbenchmark/customer-replica:v1(10GB, 79-layer image matching customer manifest)operation_timeout: 30m,pod_startup_latency_threshold: 600smax_podsfrom 30 to 26 to prevent pod scheduling failures due to node resource constraintsTesting
Validated with 10-node cluster pulling 10GB custom image: