feat: eval consolidation, nested markers, and AAP spec updates#2
Merged
Conversation
Separate AAP specification into two focused documents: - aap-spec-init.md: guidance for generating markup with target markers during artifact creation - aap-spec-maintain.md: guidance for responding with edit envelopes during artifact maintenance This improves clarity on agent responsibilities based on the artifact lifecycle phase.
Update main AAP specification to reflect metadata structure changes. The operation field is replaced with meta object to better organize format and other metadata. This aligns with the actual implementation and makes the envelope structure clearer for edit operations.
Implement proper handling of nested <aap:target> markers by tracking nesting depth rather than matching the first closing tag. This allows targets to safely contain other targets. Adds find_matching_close() function to locate the correct closing tag when multiple targets are nested. Applies to both Rust (src/markers.rs) and Python (evals/src/aap_evals/markers.py) implementations. Includes additional test for nested target outer extraction.
Refactor apply_edit to handle delete operations differently from other operations. Delete now uses inclusive range (markers included) while replace/insert operations use exclusive range (markers excluded). Add find_by_id_inclusive() to Resolve trait and TextResolver to support this distinction. Add resolve_target_inclusive() helper function. Restructure operation matching to split delete from other op types. Update test assertion to reflect that delete removes both content and markers completely.
Update evals configuration: - Change default Google provider from gemini-3.1-flash-lite-preview to gemini-2.5-flash - Load separated AAP specification files (init and maintain) instead of single spec file - Pass appropriate spec to each agent based on task phase This aligns with the split AAP specification and uses more current model defaults.
Rebuild compiled Python extension following updates to marker handling logic. Recompilation captures nested target marker support.
Clean up experiment data directory by removing old evaluation results and intermediate outputs. Consolidate experiment runs, keeping only the final turn outputs and updated HTML artifacts. Removes eval.json and metrics.json files for experiments that have been consolidated. Removes intermediate turn outputs (turns 1-4 for aap, turns 3-4 for base runs) that are no longer needed. Updates remaining artifacts with new evaluation results.
Update Cargo.lock with resolved dependency versions following changes to project dependencies.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan