[ON HOLD] Modify kubelet with flags using Helm chart and jq during validation phase to improve node pressure management#1034
[ON HOLD] Modify kubelet with flags using Helm chart and jq during validation phase to improve node pressure management#1034pdamianov-dev wants to merge 33 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “CRI pressure” execution path that can modify kubelet flags via a privileged DaemonSet before running ClusterLoader2, enabling node pressure testing for node hardening work.
Changes:
- Switches the k8s-resource-pressure topology to use a new
cri-pressureengine template. - Adds a new pipeline step template that applies a kubelet-modifying DaemonSet before running the benchmark.
- Extends
cri.pywith amodify-kubeletsubcommand and adds DaemonSet-manifest generation inclusterloader2/utils.py.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| steps/topology/k8s-resource-pressure/execute-clusterloader2.yml | Routes the topology to the new cri-pressure execution template. |
| steps/engine/clusterloader2/cri-pressure/execute.yml | Adds a pre-benchmark step to apply a kubelet config updater DaemonSet, then runs CL2. |
| modules/python/clusterloader2/utils.py | Adds DaemonSet YAML generation for updating kubelet flags (but also modifies imports/initialization). |
| modules/python/clusterloader2/cri/cri.py | Adds modify-kubelet CLI command and wires it to the DaemonSet generator. |
…/telescope into pdamianov-dev/k8s-pressure
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| "--registry_info", type=str, help="Container registry information scraped", | ||
| ) | ||
|
|
||
| # Sub-command for modify-kubelet |
There was a problem hiding this comment.
if you need to apply daemonset before benchmarking start, you can apply it at validation stage rather than in the engine e.g : https://github.com/Azure/telescope/pull/1045/changes
https://github.com/Azure/telescope/blob/main/steps/topology/karpenter/validate-resources.yml
I am not exactly sure what is this for
if you need to modify the node image, can do something like this https://github.com/Azure/telescope/pull/1044/changes
…/telescope into pdamianov-dev/k8s-pressure
…/telescope into pdamianov-dev/k8s-pressure
Removed the modify_kubelet_clusterloader2 function and its associated functionality.
| @@ -1,8 +1,34 @@ | |||
| trigger: none | |||
|
|
|||
| parameters: | |||
There was a problem hiding this comment.
We do not need parameters , simply enable them in your test
n1-p60-memory-managed:
node_count: 1
max_pods: 60
repeats: 1
operation_timeout: 3m
load_type: memory
pod_startup_latency_threshold: 23s
kubernetes_version: "1.34"
k8s_os_disk_type: Managed
scrape_kubelets: True
enable_custom_kubelet: false
kubelet_config_type: "eviction-hard" # eviction-soft, eviction-soft-grace-period
in validate-resources
can call
$KUBELET_CONFIG_TYPE
if KUBELET_CONFIG_TYPE == ëviction-hard :
- override the memory, nodefs and pid
else if
else if
so on and so on
There was a problem hiding this comment.
What if we want to pass in different flags with each run without changing the pipeline definition?
There was a problem hiding this comment.
Our intention may not be to schedule this pipeline, we will most likely do it manually or have a separate pipeline trigger
There was a problem hiding this comment.
what if i set a queue time variable instead?
There was a problem hiding this comment.
The variable group are usually for secret key values like credentials and masked value of aks_custom_header. Since you only care about manual run, you can have 3 stages eviction-hard, eviction-soft. eviction-soft-grace-period. And you can select only the one which you want to trigger. Another option is to use the test branch and you can override whatever values you want for manual testing. And when you are sure of the values, you can submit schedule PR to the main branch


This is related to the creation of a suitable framework to run pressure tests against node images as part of Node Hardening efforts.
Supporting documents: