feat: add EvaluationPlugin for agent invocation evaluation and retry by afarntrog · Pull Request #5 · afarntrog/evals

afarntrog · 2026-03-17T20:24:39Z

Introduce an EvaluationPlugin that hooks into agent invocations to evaluate outputs against expected results and automatically retries with improved system prompts on failure.

Add EvaluationPlugin class that wraps agent call to intercept invocations, run evaluators, and retry with LLM-suggested prompt improvements when evaluations fail
Add improvement suggestion prompt template for generating better system prompts based on evaluation feedback
Add comprehensive test suite covering plugin initialization, wrapping, evaluation execution, retry logic, and edge cases
Update ruff config to ignore line-length in plugin prompt templates

Description

Related Issues

Documentation PR

Type of Change

Bug fix
New feature
Breaking change
Documentation update
Other (please describe):

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Introduce an EvaluationPlugin that hooks into agent invocations to evaluate outputs against expected results and automatically retries with improved system prompts on failure. - Add EvaluationPlugin class that wraps agent __call__ to intercept invocations, run evaluators, and retry with LLM-suggested prompt improvements when evaluations fail - Add improvement suggestion prompt template for generating better system prompts based on evaluation feedback - Add comprehensive test suite covering plugin initialization, wrapping, evaluation execution, retry logic, and edge cases - Update ruff config to ignore line-length in plugin prompt templates

- Add comprehensive docstrings to all methods in EvaluationPlugin - Change _suggest_improvements to return ImprovementSuggestion instead of str - Add debug logging with reasoning when applying improved system prompt - Replace Union[X, Y] with modern X | Y syntax - Use dict default in kwargs.get() instead of

- Export new module from the package's public API - Bump minimum strands-agents dependency from 1.0.0 to 1.28.0 to support functionality required by the plugins module

afarntrog had a problem deploying to auto-approve March 17, 2026 20:24 — with GitHub Actions Failure

rm print

2e8d943

afarntrog had a problem deploying to auto-approve March 18, 2026 14:22 — with GitHub Actions Failure

afarntrog had a problem deploying to auto-approve March 18, 2026 14:44 — with GitHub Actions Failure

feat: add plugins module and bump strands-agents to >=1.28.0

e2fa123

- Export new module from the package's public API - Bump minimum strands-agents dependency from 1.0.0 to 1.28.0 to support functionality required by the plugins module

afarntrog had a problem deploying to auto-approve March 18, 2026 15:29 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add EvaluationPlugin for agent invocation evaluation and retry#5

feat: add EvaluationPlugin for agent invocation evaluation and retry#5
afarntrog wants to merge 4 commits intomainfrom
wip/evaluation-plugin

afarntrog commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

afarntrog commented Mar 17, 2026

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant