Skip to content

feat: eval consolidation, nested markers, and AAP spec updates#2

Merged
urmzd merged 9 commits intomainfrom
feat/evals-and-markers
Apr 3, 2026
Merged

feat: eval consolidation, nested markers, and AAP spec updates#2
urmzd merged 9 commits intomainfrom
feat/evals-and-markers

Conversation

@urmzd
Copy link
Copy Markdown
Owner

@urmzd urmzd commented Apr 3, 2026

Summary

  • docs(aap): Split specification into init and maintain phases; add meta field and clearer envelope structure
  • feat(markers): Support nested target markers with depth counting
  • feat(apply): Use inclusive range for delete operations
  • build/test(evals): Update Google provider default, split spec loading, rebuild Rust Python extension, consolidate experiment outputs
  • fix: Clean up broken outputs and update dependency versions

Test plan

  • Verify eval experiments still run correctly with consolidated outputs
  • Test nested marker parsing with depth counting
  • Confirm inclusive range delete operations work as expected
  • Review AAP spec changes for correctness

urmzd added 9 commits April 3, 2026 03:12
Separate AAP specification into two focused documents:
- aap-spec-init.md: guidance for generating markup with target
  markers during artifact creation
- aap-spec-maintain.md: guidance for responding with edit
  envelopes during artifact maintenance

This improves clarity on agent responsibilities based on the
artifact lifecycle phase.
Update main AAP specification to reflect metadata structure
changes. The operation field is replaced with meta object to
better organize format and other metadata. This aligns with
the actual implementation and makes the envelope structure
clearer for edit operations.
Implement proper handling of nested <aap:target> markers by
tracking nesting depth rather than matching the first closing
tag. This allows targets to safely contain other targets.

Adds find_matching_close() function to locate the correct
closing tag when multiple targets are nested. Applies to both
Rust (src/markers.rs) and Python (evals/src/aap_evals/markers.py)
implementations.

Includes additional test for nested target outer extraction.
Refactor apply_edit to handle delete operations differently
from other operations. Delete now uses inclusive range (markers
included) while replace/insert operations use exclusive range
(markers excluded).

Add find_by_id_inclusive() to Resolve trait and TextResolver
to support this distinction. Add resolve_target_inclusive()
helper function. Restructure operation matching to split delete
from other op types.

Update test assertion to reflect that delete removes both
content and markers completely.
Update evals configuration:
- Change default Google provider from gemini-3.1-flash-lite-preview
  to gemini-2.5-flash
- Load separated AAP specification files (init and maintain) instead
  of single spec file
- Pass appropriate spec to each agent based on task phase

This aligns with the split AAP specification and uses more
current model defaults.
Rebuild compiled Python extension following updates to marker
handling logic. Recompilation captures nested target marker
support.
Clean up experiment data directory by removing old evaluation
results and intermediate outputs. Consolidate experiment runs,
keeping only the final turn outputs and updated HTML artifacts.

Removes eval.json and metrics.json files for experiments that
have been consolidated. Removes intermediate turn outputs
(turns 1-4 for aap, turns 3-4 for base runs) that are no longer
needed. Updates remaining artifacts with new evaluation results.
Update Cargo.lock with resolved dependency versions following
changes to project dependencies.
@urmzd urmzd merged commit d1158a3 into main Apr 3, 2026
1 check passed
@urmzd urmzd deleted the feat/evals-and-markers branch April 3, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant