AI assisted Code Generation: Testing Harness Literature Review #70

thomas-bartlett · 2026-05-05T15:24:09Z

thomas-bartlett
May 5, 2026
Maintainer

Before building a testing harness for secure AI-assisted code development, we should review the existing landscape of AI agent evaluation harnesses and identify what can be reused.

Starting examples:

Suggested prompts for the review:

Which existing harnesses evaluate agentic workflows, not just standalone model responses?
Which support code generation or code review tasks?
How do they represent fixtures, prompts, expected outcomes, and repeated runs?
How do they score results: static analysis, tests, human review, rubric scoring, SARIF, or another oracle?
For Inspect AI specifically, could secure-development guidance fit as a solver, a scorer, or both?

Desired output: a short landscape review with a recommendation on what the SIG should reuse versus what we need to define ourselves.

bact · 2026-05-05T15:25:53Z

bact
May 5, 2026

For Inspect AI, I can see that the Codeguard could be either Solver or Scorer in its evaluation framework.

See more
https://inspect.aisi.org.uk/reference/inspect_ai.solver.html

1 reply

thomas-bartlett May 5, 2026
Maintainer Author

Good idea! I added that in the main post.

kcalloway · 2026-05-12T15:12:41Z

kcalloway
May 12, 2026

I did a lit review, but want to know what others have learned. Should we schedule a meeting to discuss our findings? Is the deliverable a google doc?

3 replies

bact May 19, 2026

Google Docs is a very good option (unless some can't access Google Docs)

kcalloway May 19, 2026

Here's a lit review and NotebookLM

bact May 19, 2026

Thank you. I will look at it.

thomas-bartlett · 2026-05-26T14:42:25Z

thomas-bartlett
May 26, 2026
Maintainer Author

Posting this here in case it is helpful for others. Here is a summary of my research review that I have been iterating on:

Proposed Evaluation Stack for CodeGuard

We have been reviewing open-source and modular tools that could help us test CodeGuard against coding agents, tool usage, security-sensitive workflows, and agent behavior over time.

The goal is not to pick tools based on popularity alone. We want a stack that is modular, maintainable, mature enough for core use, and flexible enough to support research-oriented security evaluation.

Tool	Maintainer / Origin	Potential Role	Pros	Cons / Risks
Inspect AI	UK AI Security Institute / Meridian Labs	Primary evaluation harness	Modular eval framework with datasets, solvers, scorers, tools, agents, sandboxing, and multi-turn support. Strong fit for structured agent evaluation.	Smaller ecosystem than some developer-focused tools. We may need to build CodeGuard-specific adapters and scorers.
Promptfoo	Promptfoo	CI regression and red-team testing	Practical, developer-friendly, and easy to integrate into CI. Useful for prompt tests, model comparisons, assertions, and lightweight red-team checks.	Less suited as the single source of truth for deeper repo-level or sandboxed agent evaluations.
DeepEval	Confident AI	Agent and tool-use metrics	Pytest-style framework with useful metrics for task completion, tool correctness, argument correctness, and agent behavior.	Best used as a metrics layer rather than the primary harness.
Ragas	Exploding Gradients / open-source community	Tool-call and trajectory scoring	Useful when we have expected tool calls, reference answers, or trajectories. Good fit for focused scoring of agent/tool behavior.	More useful for metric evaluation than end-to-end orchestration.
Langfuse, Phoenix, or MLflow	Langfuse, Arize, LF Projects / Databricks ecosystem	Observability and trace review	Helpful for inspecting runs, comparing experiments, debugging failures, and storing evaluation traces.	We should choose one observability layer instead of adopting several overlapping tools.
garak	NVIDIA / open-source community	LLM vulnerability scanning	Useful for adversarial probes such as prompt injection, leakage, jailbreaks, and other LLM security failure modes.	Focused more on model/application vulnerabilities than full coding-agent workflows.
PyRIT	Microsoft	Red-team test generation and orchestration	Strong fit for generating and managing adversarial security test cases that can feed into repeatable evaluations.	More of a red-team framework than a general evaluation harness.
RAMPART	Microsoft	Research pilot for agent safety/security testing	Directionally relevant: pytest-native safety and security testing for agents, with an emphasis on turning red-team findings into repeatable tests.	Still new. Treat as a research pilot or watchlist item, not a core dependency yet.
CodeQL, Semgrep, and SARIF	GitHub, Semgrep, OASIS standard	Security evidence and static-analysis output	Useful for grounding results in scanner findings, regression checks, and normalized security output. SARIF gives us a standard way to aggregate findings.	Static analysis will not catch every agent failure mode, so these should be supporting signals rather than the only scoring mechanism.

Recommendation

Use Inspect AI as the primary evaluation backbone. It provides the best foundation for modular, repeatable, tool-aware evaluations and gives us room to build CodeGuard-specific tasks, scorers, and sandboxed workflows.

Pair it with Promptfoo for lightweight CI regression tests and quick red-team checks. Add DeepEval or Ragas where we need focused scoring for tool calls, arguments, trajectories, or task completion.

For debugging and experiment review, choose one observability layer—likely Langfuse, Phoenix, or MLflow—rather than adopting multiple overlapping platforms.

For security-specific testing, use garak and PyRIT as mature adversarial testing inputs. Treat RAMPART as a promising research pilot because it is highly relevant to agent safety testing, but still too new to rely on as a core dependency.

In short: Inspect AI + Promptfoo + one observability layer + CodeQL/Semgrep/SARIF + garak/PyRIT gives us a balanced stack across maturity, modularity, CI integration, traceability, and security evaluation.

0 replies

bact · 2026-05-27T16:05:11Z

bact
May 27, 2026

One thing we also have to think about is what to measure.
Below is a quick though on metrics we can use. We may like to review more code correction benchmarking (apart from SWE-bench) to see more metrics.

Metrics for secure AI code generation benchmarking

Categories of the metrics could be derived from the goal of the secure AI code generation. In the context of AI code assistant that help with existing code, the goal could break down into four activities:

Correctly identify the security issue
Correctly identify the locations in the code of the security issue from (1)
Correctly fix the security issue, without introducing new security issue (make code more secure)
Retain the existing expected behavior of the code (functionality intact, pass tests)

Measurement

Measurement of (1) can be broad security category or it can also be specific CWE number. This will be around precision and recall.
Measurement of (2) can be line of code (LOC). We can also give a lower score for wrong line of code but correct block, function or module. We can also allow a small window of error (+/- 1 line etc). Or if we want to be very specific, we can also go to character level.
- To make measurement more precise, should we normalize (format) the code and remove all comments and blank lines?
Measurement of (4) will be based on extensive test suite. How many test cases are failed after the fix in (3)?

0 replies

AI assisted Code Generation: Testing Harness Literature Review #70

Uh oh!

Uh oh!

thomas-bartlett May 5, 2026 Maintainer

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

bact May 5, 2026

Uh oh!

thomas-bartlett May 5, 2026 Maintainer Author

Uh oh!

kcalloway May 12, 2026

Uh oh!

bact May 19, 2026

Uh oh!

kcalloway May 19, 2026

Uh oh!

bact May 19, 2026

Uh oh!

thomas-bartlett May 26, 2026 Maintainer Author

Proposed Evaluation Stack for CodeGuard

Recommendation

Uh oh!

bact May 27, 2026

Metrics for secure AI code generation benchmarking

Measurement

thomas-bartlett
May 5, 2026
Maintainer

Replies: 4 comments 4 replies

bact
May 5, 2026

thomas-bartlett May 5, 2026
Maintainer Author

kcalloway
May 12, 2026

thomas-bartlett
May 26, 2026
Maintainer Author

bact
May 27, 2026