AI assisted Code Generation: Testing Harness Literature Review #70
Replies: 4 comments 4 replies
-
|
For Inspect AI, I can see that the Codeguard could be either Solver or Scorer in its evaluation framework. See more |
Beta Was this translation helpful? Give feedback.
-
|
I did a lit review, but want to know what others have learned. Should we schedule a meeting to discuss our findings? Is the deliverable a google doc? |
Beta Was this translation helpful? Give feedback.
-
|
Posting this here in case it is helpful for others. Here is a summary of my research review that I have been iterating on: Proposed Evaluation Stack for CodeGuardWe have been reviewing open-source and modular tools that could help us test CodeGuard against coding agents, tool usage, security-sensitive workflows, and agent behavior over time. The goal is not to pick tools based on popularity alone. We want a stack that is modular, maintainable, mature enough for core use, and flexible enough to support research-oriented security evaluation.
RecommendationUse Inspect AI as the primary evaluation backbone. It provides the best foundation for modular, repeatable, tool-aware evaluations and gives us room to build CodeGuard-specific tasks, scorers, and sandboxed workflows. Pair it with Promptfoo for lightweight CI regression tests and quick red-team checks. Add DeepEval or Ragas where we need focused scoring for tool calls, arguments, trajectories, or task completion. For debugging and experiment review, choose one observability layer—likely Langfuse, Phoenix, or MLflow—rather than adopting multiple overlapping platforms. For security-specific testing, use garak and PyRIT as mature adversarial testing inputs. Treat RAMPART as a promising research pilot because it is highly relevant to agent safety testing, but still too new to rely on as a core dependency. In short: Inspect AI + Promptfoo + one observability layer + CodeQL/Semgrep/SARIF + garak/PyRIT gives us a balanced stack across maturity, modularity, CI integration, traceability, and security evaluation. |
Beta Was this translation helpful? Give feedback.
-
|
One thing we also have to think about is what to measure. Metrics for secure AI code generation benchmarkingCategories of the metrics could be derived from the goal of the secure AI code generation. In the context of AI code assistant that help with existing code, the goal could break down into four activities:
Measurement
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Before building a testing harness for secure AI-assisted code development, we should review the existing landscape of AI agent evaluation harnesses and identify what can be reused.
Starting examples:
Suggested prompts for the review:
Desired output: a short landscape review with a recommendation on what the SIG should reuse versus what we need to define ourselves.
Beta Was this translation helpful? Give feedback.
All reactions