Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion .env.sample
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,16 @@ LOGFIRE_ENABLED=false
JUDGE_MODEL=openrouter:google/gemini-3-flash-preview

# Default model used by Agents
MODEL_NAME=openrouter:google/gemini-3-flash-preview
MODEL_NAME=openrouter:google/gemini-3-flash-preview

# Default spec & exploration models used by CodeMode pipeline
# Spec agent generates tests
# Exploration agent generates snippets to discover behaviors (code-execution)
SPEC_MODEL=openrouter:anthropic/claude-opus-4.6
EXPLORATION_MODEL=openrouter:anthropic/claude-sonnet-4.6

# Default spec & exploration models used by CodeMode benchmark pipeline
# NOTE: Models should be comma separated, length of spec models must equal to exploration models
# spec[i] will be mapped to exploration[i] (Case N)
BENCHMARK_SPEC_MODELS=openrouter:anthropic/claude-opus-4.6
BENCHMARK_EXPLORATION_MODELS=openrouter:anthropic/claude-sonnet-4.6
4 changes: 2 additions & 2 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
python-version: ["3.11", "3.12", "3.13", "3.14"]

steps:
- uses: actions/checkout@v4
Expand All @@ -30,7 +30,7 @@ jobs:
- name: Install dependencies
run: |
source venv/bin/activate
uv pip install -e ".[dev]"
uv pip install -e ".[all]"

- name: Run tests
run: |
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,7 @@ evaluations/
# !!
TODO
docs/FIXTURE_GENERATION_RFC.md

# Benchmarks
benchmark*
important-links.md
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
[submodule "skills/vowel-core"]
path = skills/vowel-core
url = https://github.com/fswair/vowel-core.git
[submodule "codemode-benchmark"]
path = codemode-benchmark
url = https://github.com/fswair/codemode-benchmark
9 changes: 9 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,14 @@ This document contains concise rules for how agents should inspect and use this
- If you have questions or uncertainty, consult `README.md` and the relevant docs pages.
- Check `TODO` for pending tasks or known issues.

## Critical Thinking & Intellectual Honesty

- **Never defer to the user's idea just because they said it.** Evaluate every proposal — yours or the user's — on its own merits: trade-offs, costs, complexity, correctness.
- **If the user's idea has flaws, say so.** Explain why with concrete reasoning (performance, token cost, latency, maintainability, correctness risk). Do not soften criticism to be agreeable.
- **If your own idea has flaws, admit it first.** Don't wait for the user to find the holes. Present disadvantages upfront.
- **When comparing approaches, use structured analysis:** list pros/cons for each, identify the real trade-offs, and state which you'd pick and why — before asking for input.
- **"You're right" must be earned.** If you catch yourself agreeing immediately, stop and ask: "Did I actually evaluate this, or am I just being agreeable?" If the latter, go back and do the analysis.
- **The user is a collaborator, not an authority.** Good ideas win regardless of who proposed them. Bad ideas lose regardless of who proposed them.

These rules help agents use the project consistently and safely.

198 changes: 198 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# CHANGELOG

## codemode_driven_generation

This document summarizes the main features added or improved on this branch.

## 1) Executor and ExecutionSession protocols

- The code execution interface was formalized using Protocols.
- The Executor async/sync API was standardized:
- execute(...)
- execute_sync(...)
- create_session(...)
- ExecutionSession now compiles/executes setup code once and supports multi-snippet feed execution.
- This reduces repeated parse/compile overhead while exploring the same function.
- The run_sync helper was hardened for running-loop environments via nest-asyncio.

## 2) MontyExecutor, DefaultExecutor, MontySession, FallbackSession structures

- MontyExecutor was added:
- sandboxed execution via pydantic-monty,
- ResourceLimits support (timeout/memory),
- stdout capture and normalized error typing/messages,
- DefaultExecutor was added/improved:
- pure Python exec-based fallback execution,
- last-expression capture (__result__) and stdout capture.
- MontyReplSession (MontySession role) was added:
- one-time setup load, reusable feed-run model.
- FallbackSession was added:
- Session-level fallback: if Monty session initialization fails, switch entirely to DefaultSession.
- Snippet-level fallback: if Monty returns ModuleNotFoundError for a snippet, rerun that snippet via fallback executor.
- Executor/fallback wiring was simplified through resolve_executors.

## 3) Main implementation: CodeModeGenerator

- Two-phase exploration-guided generation flow:
- Phase 1: behavior exploration (exploration snippets + error snippets)
- Phase 2: spec generation from verified observations
- Lazy Agent architecture:
- explorer_agent (ExplorationPlan)
- spec_agent (EvalsSource or EvalsBundle)
- Prompt layers were clearly separated:
- exploration prompt: coverage, diversity, duplicate prevention
- spec prompt: expected values from verified outputs only
- A refinement loop was added:
- generate -> run -> failure_context -> regenerate
- Optional duration injection and a final summary run were added at the end.

## 4) Runtime hierarchy and utility usage

CodeMode hierarchy:

1. explore()
2. generate_spec()
3. validate_and_fix_spec()
4. validate_expected_values()
5. inject_missing_error_cases()
6. inject_durations() (optional)
7. validation/refinement with RunEvals

Utilities used:

- build_call_code
- build_failure_context
- validate_and_fix_spec
- validate_expected_values
- inject_missing_error_cases
- inject_durations

## 5) Cost Manager

- Generation/run cost tracking was added for CodeMode.
- Features:
- generation_id and run_id lifecycle management,
- step-level usage/cost recording,
- model price resolution (genai-prices or costs.yml),
- atomic/locked JSON persistence,
- generation-level and run-level totals,
- status tracking: running/completed/failed.
- The CLI costs command now supports list/by-generation/by-run views.

## 6) Serializer syntax and YAML-native serializer registry

- Top-level serializers registry support was added at EvalsFile level.
- Per-eval serializer references are now supported via serializer:.
- SerializerSpec was clarified with one-of behavior:
- schema (string or dict)
- serializer (callable import path)
- not both at the same time.
- Runtime resolver additions:
- import-path resolution,
- cached imports (_import_path_cached),
- per-eval resolution (_resolve_yaml_serializer_entry).
- Precedence between programmatic serializer maps and YAML serializer registry was defined.

## 7) Spec model / Exploration model separation

- Model separation in CodeModeGenerator constructor was formalized:
- spec_model
- exploration_model
- use_model_spec output mode was clarified:
- use_model_spec=True: structured output mode (schema/model output via EvalsBundle)
- use_model_spec=False: YAML string output mode (via EvalsSource.yaml_spec)
- HIGHLY RECOMMENDED TO KEEP use_model_spec=False.
- Model resolution order and env fallback logic were added.
- Cost tracking now supports separate model usage across separate steps.

## 8) Adding executor/fallback executor to utilities

- Utility flows were updated to accept executor and fallback executor parameters.
- Monty -> Default fallback behavior was generalized in execution-aware paths.
- Executor behavior was centralized across run_evals and validation stages.

## 9) YAML schema generator

- Runtime-model-driven schema generation was improved:
- supports top-level fixtures + serializers,
- preserves function-level EvalsMapValue behavior.
- Schema cache strategy was updated:
- content-hash-based filename (reduces stale editor cache issues).
- File header updates are handled safely via materialize_yaml_with_schema_header.

## 10) CLI komutları: schema, costs

- vowel schema <file>:
- update schema header after YAML + pydantic validation
- vowel schema --create [path]:
- direct schema JSON generation
- vowel costs:
- --list
- --by-generation
- --by-run
- --generation <id>
- --run <id>

## 11) module.function -> function alias support

- Alias support was added for programmatic mapping resolution:
- function map
- serializer schema map
- serializer function map
- Behavior:
- exact match first,
- short-name fallback,
- explicit error for ambiguous reverse short-name mapping.

## 12) Feedback-guided exploration

- A targeted Round-2 exploration flow was added:
- build cluster summaries from Round-1 results,
- generate snippets focused on uncovered behavior classes.
- Duplicate/semantic repetition minimization was reinforced at prompt level.
- Distinct failure-mode coverage was improved for error snippets.
- Additional rounds now measure value via new-behavior counting.

## 13) Assertion + serializer integration

- AssertionEvaluator input context is now serializer-aware.
- Assertions now see serialized input for schema, serial_fn, and nested/dict schema modes.
- This behavior is covered by regression tests.

## 14) LLM Judge env-ref improvements

- create_llm_judge now supports $ENV_VAR resolution for rubric/model fields.
- Missing env refs now produce clearer errors.

## 15) Examples, documentation, and test coverage

- A runnable native serializer + fixture example was added.
- README and serializer docs were updated with serializer/assertion context notes.
- Meaningful id fields were added to eval cases under examples.
- New/updated tests include:
- test_schema
- test_llm_judge_env_refs
- serializer assertion regressions
- YAML/native serializer parsing tests

## 16) Fixture scope alias support

- Fixture scopes now support clearer canonical names:
- case
- eval
- file
- Backward-compatible aliases are still accepted:
- function (alias of case)
- module (alias of eval)
- session (alias of file)
- At parse time, canonical names are normalized to legacy internal runtime values:
- case -> function
- eval -> module
- file -> session
- This keeps existing runtime lifecycle behavior unchanged while allowing more descriptive scope names in YAML.

Note: Old names would be deprecated after v1.0.0

## Note

This changelog is based on features observed and validated in code on this branch, without using git history.
9 changes: 9 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,14 @@ Claude-type agents working with this repository should follow these steps:
- If you have questions or uncertainty, consult `README.md` and the relevant docs pages.
- Check `TODO` for pending tasks or known issues.

## Critical Thinking & Intellectual Honesty

- **Never defer to the user's idea just because they said it.** Evaluate every proposal — yours or the user's — on its own merits: trade-offs, costs, complexity, correctness.
- **If the user's idea has flaws, say so.** Explain why with concrete reasoning (performance, token cost, latency, maintainability, correctness risk). Do not soften criticism to be agreeable.
- **If your own idea has flaws, admit it first.** Don't wait for the user to find the holes. Present disadvantages upfront.
- **When comparing approaches, use structured analysis:** list pros/cons for each, identify the real trade-offs, and state which you'd pick and why — before asking for input.
- **"You're right" must be earned.** If you catch yourself agreeing immediately, stop and ask: "Did I actually evaluate this, or am I just being agreeable?" If the latter, go back and do the analysis.
- **The user is a collaborator, not an authority.** Good ideas win regardless of who proposed them. Bad ideas lose regardless of who proposed them.

These guidelines are intended to help Claude agents use the repository consistently.

50 changes: 49 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ pip install -e ".[all]"
## Quick Start

> **Note:**
> For a deeper understanding of how vowel handles fixtures, see the examples in [`db_fixture.yml`](./db_fixture.yml) and [`db.py`](./db.py). These files demonstrate the underlying mechanics of fixture setup and usage.
> For a deeper understanding of how vowel handles fixtures, see the examples in [`examples/db_fixtures`](./examples/db_fixtures/). These example demonstrate the underlying mechanics of fixture setup and usage.

> **Tip:**
> To enable YAML schema validation in your editor, place `vowel-schema.json` in your project directory.
Expand Down Expand Up @@ -122,6 +122,8 @@ summary = (
summary.print()
```

> **Name matching note:** If your YAML uses `module.function`, programmatic mappings can use either the exact key (`module.function`) or the short function name (`function`) in `.with_functions(...)`.

---

## Features
Expand Down Expand Up @@ -181,6 +183,29 @@ def query_user(user_id: int, *, db: dict) -> dict | None:
return db["users"].get(user_id)
```

Fixture scope aliases:
- Preferred scope names: `case`, `eval`, `file`
- Backward-compatible aliases: `function`, `module`, `session`
- Normalization mapping: `case -> function`, `eval -> module`, `file -> session`

Example:

```yaml
fixtures:
temp_data:
setup: myapp.make_temp_data
scope: case

db:
setup: myapp.setup_db
teardown: myapp.close_db
scope: eval

cache:
setup: myapp.setup_cache
scope: file
```

> **Full reference:** [docs/FIXTURES.md](https://github.com/fswair/vowel/blob/main/docs/FIXTURES.md)

### Input Serializers
Expand All @@ -196,6 +221,26 @@ summary = (
)
```

> **Serializer key matching:** Serializer mappings follow the same rule as `.with_functions(...)` — both `module.function` and short `function` keys are accepted.

> **Assertion context and serializers:** When a serializer is configured, assertion evaluators use the serialized value for `input` (not raw YAML). This applies to schema mode, `serial_fn`, and nested/dict schemas.

Runnable example (YAML-native serializers + fixtures):

```bash
vowel examples/serializers/db_query_evals.yml
```

This example demonstrates:
- top-level `serializers:` registry with both `schema` and `serializer` entries,
- per-eval `serializer:` references,
- fixture class lifecycle wiring with `cls` + `teardown`,
- assertion checks that read serialized `input` values.

See:
- `examples/serializers/db_query_evals.yml`
- `examples/serializers/util.py`

> **Full reference:** [docs/SERIALIZERS.md](https://github.com/fswair/vowel/blob/main/docs/SERIALIZERS.md)

### AI-Powered Generation
Expand Down Expand Up @@ -259,6 +304,9 @@ vowel evals.yml --dry-run # Show plan without running
vowel evals.yml --export-json out.json # Export results
vowel evals.yml -v # Verbose summary
vowel evals.yml -v --hide-report # Verbose, hide pydantic_evals report
vowel schema examples/serializers/db_query_evals.yml # Validate + update schema header
vowel schema --create # Generate vowel-schema.json
vowel costs --list # List tracked generation/run costs
```

> **Full reference:** [docs/CLI.md](https://github.com/fswair/vowel/blob/main/docs/CLI.md)
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.3.5
0.4.0
Loading
Loading