Skip to content

Add custom image support#1054

Open
jasminetMSFT wants to merge 9 commits intomainfrom
jasminet/custom-image-support
Open

Add custom image support#1054
jasminetMSFT wants to merge 9 commits intomainfrom
jasminet/custom-image-support

Conversation

@jasminetMSFT
Copy link
Contributor

Summary

Enables the use of custom container images in image pull benchmarks, allowing more realistic testing with large, multi-layer images that match real customer workloads.

Changes

  • Custom image support in deployment templates: Added TestImage parameter to deployment_template.yaml to allow overriding the default test image
  • CRI module updates: Extended cri.py and execute.yml to pass through the test_image parameter from pipeline configuration
  • Updated image-pull-n10 pipeline:
    • Configured to use benchmark/customer-replica:v1 (10GB, 79-layer image matching customer manifest)
    • Increased timeouts: operation_timeout: 30m, pod_startup_latency_threshold: 600s
    • Reduced max_pods from 30 to 26 to prevent pod scheduling failures due to node resource constraints

Testing

Validated with 10-node cluster pulling 10GB custom image:

@jasminetMSFT jasminetMSFT marked this pull request as ready for review February 12, 2026 00:48
Copilot AI review requested due to automatic review settings February 12, 2026 00:48
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for overriding the container image used by the ClusterLoader2 CRI/image-pull benchmark so pipelines can exercise large, multi-layer “customer-like” images.

Changes:

  • Plumbs a new test_image/CL2_TEST_IMAGE parameter from the pipeline step into the CRI override generator (cri.py) and CL2 config.
  • Updates the deployment template to use a configurable image (AKS/Linux/memory path) and adds a topology spread constraint to improve node distribution.
  • Refines the containerd throughput “AvgPerNode” PromQL query to exclude nodes with no pulls.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
steps/engine/clusterloader2/cri/execute.yml Passes --test_image into the CRI override step via env/config.
modules/python/clusterloader2/cri/cri.py Adds --test_image CLI arg and writes CL2_TEST_IMAGE into overrides.
modules/python/clusterloader2/cri/config/deployment_template.yaml Introduces TestImage template param, uses it for the AKS/Linux/memory image, and adds topology spread constraints.
modules/python/clusterloader2/cri/config/containerd-measurements.yaml Filters “AvgPerNode” to active nodes in PromQL.
modules/python/clusterloader2/cri/config/config.yaml Wires CL2_TEST_IMAGE into the deployment template fill map.
Comments suppressed due to low confidence (1)

modules/python/clusterloader2/cri/config/deployment_template.yaml:59

  • When TestImage is not the default, this template no longer renders an explicit command/args for the memory container (the stress block is gated by if eq $TestImage ...). That changes the benchmark from a known long-running workload to whatever the image entrypoint does, which can cause early exits or add non-pull-related variance. Consider keeping a stable long-running command for all images, or making the command configurable alongside TestImage.
          {{if eq $TestImage "e2e-test-images/resource-consumer:1.13"}}
        command:
          - stress
        args:
          - --vm

Comment on lines +105 to +109
# Spread pods across nodes to ensure each node pulls the image
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this topologySpreadConstraints block “ensure[s] each node pulls the image”, but with whenUnsatisfiable: ScheduleAnyway it’s only best-effort (it won’t guarantee one pod per node). Consider either softening the wording, or switching to DoNotSchedule if you need a hard guarantee.

Copilot uses AI. Check for mistakes.
node_count, node_per_step, max_pods, repeats, operation_timeout,
load_type, scale_enabled, pod_startup_latency_threshold, provider,
registry_endpoint, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file):
registry_endpoint, test_image, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file):
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override_config_clusterloader2 now requires the new test_image parameter, but existing in-repo callers (e.g., modules/python/tests/test_cri.py) invoke it without test_image, which will raise a TypeError and break the unit test suite. Consider making test_image optional with a default (matching the CLI default) and/or updating all internal call sites/tests to pass it explicitly.

Suggested change
registry_endpoint, test_image, os_type, scrape_kubelets, scrape_containerd, containerd_scrape_interval, host_network, override_file):
registry_endpoint, test_image=None, os_type=None, scrape_kubelets=False, scrape_containerd=False, containerd_scrape_interval=None, host_network=False, override_file=None):

Copilot uses AI. Check for mistakes.
Comment on lines 50 to 54
{{if eq $Provider "aks"}}
image: {{$RegistryEndpoint}}/e2e-test-images/resource-consumer:1.13
image: {{$RegistryEndpoint}}/{{$TestImage}}
{{else}}
image: registry.k8s.io/e2e-test-images/resource-consumer:1.13
{{end}}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestImage is added as a template parameter, but it’s only used in the Linux/memory branch (Provider == aks). The CPU branch (and non-AKS providers) still uses a hard-coded image, so setting CL2_TEST_IMAGE will be ignored in those cases. If custom image support is intended generally for this scenario, apply it consistently (or document the limitation).

Copilot uses AI. Check for mistakes.
image-pull-10pods-authenticated:
node_count: 10
max_pods: 30
max_pods: 26
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we reducing the pod count.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the summary, its because of node resource constrain.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is node autoscaling enabled on the node pool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, auto scaling is not enabled as shown in this line.

@jasminetMSFT jasminetMSFT changed the title Jasminet/custom image support Add custom image support Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants