feat(agents): Add verifier agent package by yuwenma · Pull Request #5 · kubernetes-sigs/devops-bench

yuwenma · 2026-06-08T21:00:21Z

This PR adds the verification engine. It provides a structured framework used by scenario managers and task evaluators (will be added separately) to validate cluster state (e.g., workload readiness, replica convergence) during and after chaos disruptions or operational tasks (will be added separately).

k8s-ci-robot · 2026-06-08T21:00:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yuwenma
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR introduces a new devops_bench.agents.verifier package that provides a structured verification engine (via Pydantic specs and concrete verifiers) for validating Kubernetes cluster state during/after scenarios and disruptions.

Changes:

Added VerifierAgent.wait_for_condition() to execute single or compound verification specs with aggregated results.
Implemented two initial verifiers: pod health (pod_healthy) and deployment replica convergence (scaling_complete).
Added package documentation and unit tests for the verifiers.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
devops_bench/agents/verifier/init.py	Adds verifier package marker.
devops_bench/agents/verifier/base.py	Defines `BaseVerifier` and the recursive `VerificationResult` model.
devops_bench/agents/verifier/spec.py	Defines Pydantic `VerificationSpec` and the supported spec union.
devops_bench/agents/verifier/verifier.py	Adds `VerifierAgent` orchestrator for single/compound specs and aggregated results.
devops_bench/agents/verifier/pod_healthy.py	Implements pod readiness verification via `kubectl wait` with polling fallback.
devops_bench/agents/verifier/scaling_complete.py	Implements deployment scaling convergence verification via polling `kubectl get`.
devops_bench/agents/verifier/test_verifier.py	Adds unit tests covering core verifier behaviors and compound specs.
devops_bench/agents/verifier/README.md	Documents the spec/result APIs, examples, and extension workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pradeepvrd · 2026-06-10T03:25:35Z

+SingleVerificationSpec = Union[PodHealthyVerifier, ScalingCompleteVerifier]
+
+# Top-level VerificationSpec which can parse a dict, a list, or a single checker spec.
+class VerificationSpec(RootModel[Union[Dict[str, SingleVerificationSpec], List[SingleVerificationSpec], SingleVerificationSpec]]):


Wondering if we need to support all these variants. Wouldn't it be simpler to just support Dict? We can model the other types as a dict with some constraint on the caller and can be much easier to maintain.

One big advantage of K8s API (spec and status) is its strong structured format even in YAML/JSON config. Golang enforces that by defining the structures and built it own k8s/yaml to parse the type. But Python does not have corrsponding type check if simply using the built in json parser lib. So I apply a structured type to enforce the same level of K8s YAML validaiton.

My question was from an API perspective, do we want to support something so complex. Can we not define Dict[str, SingleVerificationSpec] as the interface? We enforce an id for any spec that we validate and it keeps interface and validation much simpler. Unless there's a strong reason for us to handle all these different data structures as input, I propose we keep the API interface very simple.

Good point. If we want to follow the K8s API convension https://github.com/kubernetes/community/blob/main/contributors/devel/sig-architecture/api-conventions.md#spec-and-status, we should define the Dict[str, SingleVerificationSpec] because the interface has to be deterministic. If we don't plan to eventually converge to k8s CRD, I'm fine to loose the constraints here.

I don't think we aim to converge to k8s CRD. That assumes that k8s event streams are the only source of verification which is not likely to be the case. We might have to verify a variety of conditions for state convergence.

I consider this more akin to unstructured.Unstructured which is essentially map[[string]interface{}. At this point we do not have clarity on the full API range of the verifier so I suggest we keep it simple and lean on a bit more specification on client side.

Thanks for the clarification. Simplified the verifier Spec.

linux-foundation-easycla · 2026-06-10T16:25:27Z

The committers listed above are authorized under a signed CLA.

✅ login: yuwenma / name: yuwenma (7e4eafd, adb1c87)

janetkuo · 2026-06-10T17:17:33Z

@yuwenma are the Copilot review comments addressed? If so, would you resolve them to make it clear?

yuwenma · 2026-06-10T17:25:27Z

@yuwenma are the Copilot review comments addressed? If so, would you resolve them to make it clear?

They are all resolved and marked now.

pradeepvrd · 2026-06-10T23:29:36Z

@yuwenma Can you please squash the changes to keep commits atomic.

janetkuo · 2026-06-10T22:29:57Z

+                # Catch timeout or error and fallback to polling check
+                # Check status details
+                details = self._get_pods_details(timeout=min(KUBECTL_DEFAULT_TIMEOUT, remaining))
+                last_details = details


When remaining is very small (e.g., 0.01 seconds) near the end of the timeout budget, kubectl get pods will fail unconditionally.

Consider doing something similar as ScalingCompleteVerifier (which uses timeout=max(1.0, remaining)) for the fallback check, and adding an explicit final check after the loop to ensure a reliable final evaluation.

good catch. Done

janetkuo · 2026-06-10T22:29:57Z

+            dep_data = json.loads(result.stdout)
+            ready_replicas = dep_data.get("status", {}).get("readyReplicas", 0)
+            success = ready_replicas >= self.min_replicas
+            reason = (


Verify status.observedGeneration >= metadata.generation to be fully robust against race conditions where kubectl get deployment is queried before the controller has observed the latest scaling action.

Good point! Thanks

janetkuo · 2026-06-12T01:37:00Z

Just noticed #8 depends on this PR. Is there a specific reason why the verifier agent doesn't use kubewatch directly, instead of breaking up into 2 commits/PRs?

pradeepvrd · 2026-06-10T23:28:38Z

Can you also lock the dependencies and add uv.lock here?

If you ensure you are enabling the pre-commit hooks as described in the AGENTS.md, it will block you from pushing automatically.

Woohoo, +1 to AGENTS.md.

Done

pradeepvrd · 2026-06-12T00:22:05Z

+SingleVerificationSpec = Union[PodHealthyVerifier, ScalingCompleteVerifier]
+
+# Top-level VerificationSpec which can parse a dict, a list, or a single checker spec.
+class VerificationSpec(RootModel[Union[Dict[str, SingleVerificationSpec], List[SingleVerificationSpec], SingleVerificationSpec]]):


I don't think we aim to converge to k8s CRD. That assumes that k8s event streams are the only source of verification which is not likely to be the case. We might have to verify a variety of conditions for state convergence.

I consider this more akin to unstructured.Unstructured which is essentially map[[string]interface{}. At this point we do not have clarity on the full API range of the verifier so I suggest we keep it simple and lean on a bit more specification on client side.

yuwenma · 2026-06-12T16:40:54Z

Just noticed #8 depends on this PR. Is there a specific reason why the verifier agent doesn't use kubewatch directly, instead of breaking up into 2 commits/PRs?

These two PRs separate the pure migration (previously approved in gke-labs) and the new features to avoid dup-review.

janetkuo

These two PRs separate the pure migration (previously approved in gke-labs) and the new features to avoid dup-review.

@yuwenma let's just keep PR #8 (with 2 commits) and squash at merge time (I've applied the squash-merge label on PR #8).

If they're merged as 2 distinct commits, the main branch history will contain an intermediate commit with legacy kubectl wait polling loops that are immediately deleted in the next commit. The main branch should only reflect the final, working architectural state.

Would you rebase PR #8 to resolve the current merge conflicts with main, ensure the review feedback from this PR is carried over, and we move the review there?

janetkuo · 2026-06-15T19:15:50Z

As suggested above, let's move the review / discussion to #8.

k8s-ci-robot requested a review from janetkuo June 8, 2026 21:00

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 8, 2026

janetkuo requested a review from Copilot June 8, 2026 21:37

Copilot started reviewing on behalf of janetkuo June 8, 2026 21:37 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

pradeepvrd reviewed Jun 10, 2026

View reviewed changes

Comment thread devops_bench/agents/verifier/test_verifier.py Outdated

pradeepvrd reviewed Jun 10, 2026

View reviewed changes

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 10, 2026

yuwenma force-pushed the migrate-verifier branch from e0d6f5d to 7e4eafd Compare June 10, 2026 17:03

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 10, 2026

yuwenma force-pushed the migrate-verifier branch from 7e4eafd to dccb67f Compare June 11, 2026 14:45

yuwenma mentioned this pull request Jun 11, 2026

feat(verifier): implement event-driven watching using kubewatch #8

Open

janetkuo reviewed Jun 11, 2026

View reviewed changes

pradeepvrd reviewed Jun 12, 2026

View reviewed changes

yuwenma requested review from janetkuo and pradeepvrd June 12, 2026 17:52

janetkuo requested changes Jun 12, 2026

View reviewed changes

yuwenma force-pushed the migrate-verifier branch from dccb67f to bb8e962 Compare June 12, 2026 19:19

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2026

feat(agents): migrate verifier agent package from gke-labs

cc2c404

yuwenma force-pushed the migrate-verifier branch from bb8e962 to cc2c404 Compare June 12, 2026 20:38

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2026

janetkuo closed this Jun 15, 2026

Conversation

yuwenma commented Jun 8, 2026

Uh oh!

k8s-ci-robot commented Jun 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janetkuo commented Jun 10, 2026

Uh oh!

yuwenma commented Jun 10, 2026

Uh oh!

pradeepvrd commented Jun 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janetkuo commented Jun 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuwenma commented Jun 12, 2026

Uh oh!

janetkuo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janetkuo commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linux-foundation-easycla Bot commented Jun 10, 2026 •

edited

Loading

janetkuo left a comment •

edited

Loading