Skip to content

feat(agents): Add verifier agent package#5

Closed
yuwenma wants to merge 1 commit into
kubernetes-sigs:mainfrom
yuwenma:migrate-verifier
Closed

feat(agents): Add verifier agent package#5
yuwenma wants to merge 1 commit into
kubernetes-sigs:mainfrom
yuwenma:migrate-verifier

Conversation

@yuwenma

@yuwenma yuwenma commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

This PR adds the verification engine. It provides a structured framework used by scenario managers and task evaluators (will be added separately) to validate cluster state (e.g., workload readiness, replica convergence) during and after chaos disruptions or operational tasks (will be added separately).

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yuwenma
Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from janetkuo June 8, 2026 21:00
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 8, 2026
@janetkuo janetkuo requested a review from Copilot June 8, 2026 21:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new devops_bench.agents.verifier package that provides a structured verification engine (via Pydantic specs and concrete verifiers) for validating Kubernetes cluster state during/after scenarios and disruptions.

Changes:

  • Added VerifierAgent.wait_for_condition() to execute single or compound verification specs with aggregated results.
  • Implemented two initial verifiers: pod health (pod_healthy) and deployment replica convergence (scaling_complete).
  • Added package documentation and unit tests for the verifiers.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
devops_bench/agents/verifier/init.py Adds verifier package marker.
devops_bench/agents/verifier/base.py Defines BaseVerifier and the recursive VerificationResult model.
devops_bench/agents/verifier/spec.py Defines Pydantic VerificationSpec and the supported spec union.
devops_bench/agents/verifier/verifier.py Adds VerifierAgent orchestrator for single/compound specs and aggregated results.
devops_bench/agents/verifier/pod_healthy.py Implements pod readiness verification via kubectl wait with polling fallback.
devops_bench/agents/verifier/scaling_complete.py Implements deployment scaling convergence verification via polling kubectl get.
devops_bench/agents/verifier/test_verifier.py Adds unit tests covering core verifier behaviors and compound specs.
devops_bench/agents/verifier/README.md Documents the spec/result APIs, examples, and extension workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread devops_bench/agents/verifier/verifier.py Outdated
Comment thread devops_bench/agents/verifier/verifier.py Outdated
Comment thread devops_bench/agents/verifier/pod_healthy.py Outdated
Comment thread devops_bench/agents/verifier/test_verifier.py Outdated
Comment thread devops_bench/agents/verifier/test_verifier.py Outdated
Comment thread devops_bench/agents/verifier/README.md Outdated
Comment thread devops_bench/agents/verifier/README.md Outdated
Comment thread devops_bench/agents/verifier/README.md Outdated
Comment thread devops_bench/agents/verifier/README.md Outdated
Comment thread devops_bench/agents/verifier/README.md Outdated
Comment thread devops_bench/agents/verifier/test_verifier.py Outdated
Comment thread devops_bench/agents/verifier/verifiers/scaling_complete_verifier.py
Comment thread devops_bench/agents/verifier/pod_healthy.py Outdated
Comment thread devops_bench/agents/verifier/scaling_complete.py Outdated
Comment thread devops_bench/agents/verifier/pod_healthy.py Outdated
Comment thread devops_bench/agents/verifier/scaling_complete.py Outdated
Comment thread devops_bench/agents/verifier/verifier.py Outdated
Comment thread devops_bench/agents/verifier/spec.py Outdated
SingleVerificationSpec = Union[PodHealthyVerifier, ScalingCompleteVerifier]

# Top-level VerificationSpec which can parse a dict, a list, or a single checker spec.
class VerificationSpec(RootModel[Union[Dict[str, SingleVerificationSpec], List[SingleVerificationSpec], SingleVerificationSpec]]):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we need to support all these variants. Wouldn't it be simpler to just support Dict? We can model the other types as a dict with some constraint on the caller and can be much easier to maintain.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big advantage of K8s API (spec and status) is its strong structured format even in YAML/JSON config. Golang enforces that by defining the structures and built it own k8s/yaml to parse the type. But Python does not have corrsponding type check if simply using the built in json parser lib. So I apply a structured type to enforce the same level of K8s YAML validaiton.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was from an API perspective, do we want to support something so complex. Can we not define Dict[str, SingleVerificationSpec] as the interface? We enforce an id for any spec that we validate and it keeps interface and validation much simpler. Unless there's a strong reason for us to handle all these different data structures as input, I propose we keep the API interface very simple.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. If we want to follow the K8s API convension https://github.com/kubernetes/community/blob/main/contributors/devel/sig-architecture/api-conventions.md#spec-and-status, we should define the Dict[str, SingleVerificationSpec] because the interface has to be deterministic. If we don't plan to eventually converge to k8s CRD, I'm fine to loose the constraints here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we aim to converge to k8s CRD. That assumes that k8s event streams are the only source of verification which is not likely to be the case. We might have to verify a variety of conditions for state convergence.

I consider this more akin to unstructured.Unstructured which is essentially map[[string]interface{}. At this point we do not have clarity on the full API range of the verifier so I suggest we keep it simple and lean on a bit more specification on client side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Simplified the verifier Spec.

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 10, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 10, 2026
@yuwenma yuwenma force-pushed the migrate-verifier branch from e0d6f5d to 7e4eafd Compare June 10, 2026 17:03
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 10, 2026
@janetkuo

Copy link
Copy Markdown
Member

@yuwenma are the Copilot review comments addressed? If so, would you resolve them to make it clear?

@yuwenma

yuwenma commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@yuwenma are the Copilot review comments addressed? If so, would you resolve them to make it clear?

They are all resolved and marked now.

@pradeepvrd

Copy link
Copy Markdown
Contributor

@yuwenma Can you please squash the changes to keep commits atomic.

# Catch timeout or error and fallback to polling check
# Check status details
details = self._get_pods_details(timeout=min(KUBECTL_DEFAULT_TIMEOUT, remaining))
last_details = details

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When remaining is very small (e.g., 0.01 seconds) near the end of the timeout budget, kubectl get pods will fail unconditionally.

Consider doing something similar as ScalingCompleteVerifier (which uses timeout=max(1.0, remaining)) for the fallback check, and adding an explicit final check after the loop to ensure a reliable final evaluation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Done

dep_data = json.loads(result.stdout)
ready_replicas = dep_data.get("status", {}).get("readyReplicas", 0)
success = ready_replicas >= self.min_replicas
reason = (

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify status.observedGeneration >= metadata.generation to be fully robust against race conditions where kubectl get deployment is queried before the controller has observed the latest scaling action.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks

@janetkuo

Copy link
Copy Markdown
Member

Just noticed #8 depends on this PR. Is there a specific reason why the verifier agent doesn't use kubewatch directly, instead of breaking up into 2 commits/PRs?

Comment thread pyproject.toml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also lock the dependencies and add uv.lock here?

If you ensure you are enabling the pre-commit hooks as described in the AGENTS.md, it will block you from pushing automatically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo, +1 to AGENTS.md.

Done

Comment thread devops_bench/agents/verifier/spec.py Outdated
SingleVerificationSpec = Union[PodHealthyVerifier, ScalingCompleteVerifier]

# Top-level VerificationSpec which can parse a dict, a list, or a single checker spec.
class VerificationSpec(RootModel[Union[Dict[str, SingleVerificationSpec], List[SingleVerificationSpec], SingleVerificationSpec]]):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we aim to converge to k8s CRD. That assumes that k8s event streams are the only source of verification which is not likely to be the case. We might have to verify a variety of conditions for state convergence.

I consider this more akin to unstructured.Unstructured which is essentially map[[string]interface{}. At this point we do not have clarity on the full API range of the verifier so I suggest we keep it simple and lean on a bit more specification on client side.

@yuwenma

yuwenma commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Just noticed #8 depends on this PR. Is there a specific reason why the verifier agent doesn't use kubewatch directly, instead of breaking up into 2 commits/PRs?

These two PRs separate the pure migration (previously approved in gke-labs) and the new features to avoid dup-review.

@yuwenma yuwenma requested review from janetkuo and pradeepvrd June 12, 2026 17:52

@janetkuo janetkuo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two PRs separate the pure migration (previously approved in gke-labs) and the new features to avoid dup-review.

@yuwenma let's just keep PR #8 (with 2 commits) and squash at merge time (I've applied the squash-merge label on PR #8).

If they're merged as 2 distinct commits, the main branch history will contain an intermediate commit with legacy kubectl wait polling loops that are immediately deleted in the next commit. The main branch should only reflect the final, working architectural state.

Would you rebase PR #8 to resolve the current merge conflicts with main, ensure the review feedback from this PR is carried over, and we move the review there?

@yuwenma yuwenma force-pushed the migrate-verifier branch from dccb67f to bb8e962 Compare June 12, 2026 19:19
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2026
@yuwenma yuwenma force-pushed the migrate-verifier branch from bb8e962 to cc2c404 Compare June 12, 2026 20:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2026
@janetkuo

Copy link
Copy Markdown
Member

As suggested above, let's move the review / discussion to #8.

@janetkuo janetkuo closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants