diff --git a/README.md b/README.md index 8d09d42..1bfbe7b 100644 --- a/README.md +++ b/README.md @@ -257,6 +257,15 @@ a `.scm` import query — no other changes needed. --- +## Experiments archive + +Nine rounds of paired A/B agent comparisons drove the rule library +to its current state. The full archive — comparison reports, agent +deliverables, run logs, and analysis charts — lives at +[`experiments/`](experiments/README.md). + +--- + ## License MIT — see [`LICENSE`](LICENSE). diff --git a/README.zh-TW.md b/README.zh-TW.md index 6d5059e..15984ad 100644 --- a/README.zh-TW.md +++ b/README.zh-TW.md @@ -241,6 +241,13 @@ import query。 --- +## 實驗資料 + +9 輪 paired A/B agent 對比實驗推動了規則庫的演化。完整資料包(分析報告、agent 交付、run logs、分析圖表)在 +[`experiments/`](experiments/README.md)。 + +--- + ## 授權 MIT —— 見 [`LICENSE`](LICENSE)。 diff --git a/experiments/31flashlite-amb-b b/experiments/31flashlite-amb-b deleted file mode 160000 index 3c7bd61..0000000 --- a/experiments/31flashlite-amb-b +++ /dev/null @@ -1 +0,0 @@ -Subproject commit 3c7bd6132df55a86b447d35a1789323385e93144 diff --git a/experiments/31flashlite-amb-b/README.md b/experiments/31flashlite-amb-b/README.md new file mode 100644 index 0000000..aace7d8 --- /dev/null +++ b/experiments/31flashlite-amb-b/README.md @@ -0,0 +1,17 @@ +# URL Shortener REST API + +A minimal in-memory URL shortener REST API. + +## Requirements +- Python 3.12+ +- Dependencies listed in requirements.txt + +## Setup +1. pip install -r requirements.txt +2. python app.py + +## Endpoints +- POST /shorten {"url": "..."} +- GET / 302 redirect +- DELETE / +- GET /list diff --git a/experiments/31flashlite-amb-b/app.py b/experiments/31flashlite-amb-b/app.py new file mode 100644 index 0000000..0c853c0 --- /dev/null +++ b/experiments/31flashlite-amb-b/app.py @@ -0,0 +1,40 @@ +from flask import Flask, jsonify, request, redirect +import secrets + +app = Flask(__name__) +storage = {} + +@app.route('/shorten', methods=['POST']) +def shorten(): + data = request.get_json() + if not data or 'url' not in data: + return jsonify({"error": "Missing URL"}), 400 + url = data['url'] + if not (url.startswith('http://') or url.startswith('https://')): + return jsonify({"error": "Invalid URL"}), 400 + + code = secrets.token_urlsafe(6)[:6] + storage[code] = url + return jsonify({"code": code, "short_url": f"http://127.0.0.1:8080/{code}"}), 201 + +@app.route('/', methods=['GET']) +def redirect_to(code_): + url = storage.get(code_) + if not url: + return jsonify({"error": "Not found"}), 404 + return redirect(url) + +@app.route('/', methods=['DELETE']) +def delete(code_): + if code_ not in storage: + return jsonify({"error": "Not found"}), 404 + del storage[code_] + return '', 204 + +@app.route('/list', methods=['GET']) +def list_entries(): + entries = [{"code": k, "url": v} for k, v in storage.items()] + return jsonify({"entries": entries}) + +if __name__ == '__main__': + app.run(port=8080) diff --git a/experiments/31flashlite-amb-b/docs/superpowers/plans/2026-05-05-url-shortener-implementation.md b/experiments/31flashlite-amb-b/docs/superpowers/plans/2026-05-05-url-shortener-implementation.md new file mode 100644 index 0000000..254d8c5 --- /dev/null +++ b/experiments/31flashlite-amb-b/docs/superpowers/plans/2026-05-05-url-shortener-implementation.md @@ -0,0 +1,223 @@ +# URL Shortener Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build a minimal in-memory URL shortener REST API. + +**Architecture:** Use Flask for the REST API, a simple dictionary for storage, and `secrets` for URL generation. + +**Tech Stack:** Python 3.12, Flask, pytest. + +--- + +### Task 1: Setup Flask App and Storage + +**Files:** +- Create: `app.py` +- Test: `tests.py` + +- [ ] **Step 1: Create initial `app.py`** + +```python +from flask import Flask, jsonify, request + +app = Flask(__name__) +storage = {} + +@app.route('/shorten', methods=['POST']) +def shorten(): + return jsonify({"message": "OK"}), 201 + +if __name__ == '__main__': + app.run(port=8080) +``` + +- [ ] **Step 2: Create initial `tests.py`** + +```python +import pytest +from app import app + +@pytest.fixture +def client(): + return app.test_client() + +def test_shorten_endpoint(client): + response = client.post('/shorten', json={"url": "https://google.com"}) + assert response.status_code == 201 +``` + +- [ ] **Step 3: Verify app starts and tests pass** + +Run: `python3 -m pytest tests.py -v` +Expected: PASS + +- [ ] **Step 4: Commit** + +```bash +git add app.py tests.py +git commit -m "feat: setup basic flask app and test fixture" +``` + +### Task 2: Implement URL Shortening Logic + +**Files:** +- Modify: `app.py` +- Modify: `tests.py` + +- [ ] **Step 1: Update `tests.py` for validation and response** + +```python +def test_shorten_validation(client): + response = client.post('/shorten', json={"url": "invalid"}) + assert response.status_code == 400 + +def test_shorten_success(client): + response = client.post('/shorten', json={"url": "https://google.com"}) + assert response.status_code == 201 + assert 'code' in response.get_json() + assert 'short_url' in response.get_json() +``` + +- [ ] **Step 2: Update `app.py` with validation and code generation** + +```python +from flask import Flask, jsonify, request +import secrets + +app = Flask(__name__) +storage = {} + +@app.route('/shorten', methods=['POST']) +def shorten(): + data = request.get_json() + url = data.get('url', '') + if not (url.startswith('http://') or url.startswith('https://')): + return jsonify({"error": "Invalid URL"}), 400 + + code = secrets.token_urlsafe(6)[:6] + storage[code] = url + return jsonify({"code": code, "short_url": f"http://127.0.0.1:8080/{code}"}), 201 + +if __name__ == '__main__': + app.run(port=8080) +``` + +- [ ] **Step 3: Run tests to verify** + +Run: `python3 -m pytest tests.py -v` +Expected: PASS + +- [ ] **Step 4: Commit** + +```bash +git add app.py tests.py +git commit -m "feat: implement shorten logic" +``` + +### Task 3: Implement Redirect (GET /) + +**Files:** +- Modify: `app.py` +- Modify: `tests.py` + +- [ ] **Step 1: Add redirect test** + +```python +def test_redirect(client): + # Setup + client.post('/shorten', json={"url": "https://google.com"}) + # Get code + res = client.get('/list') + code = res.get_json()['entries'][0]['code'] + + # Test + response = client.get(f'/{code}') + assert response.status_code == 302 + assert response.location == "https://google.com" + +def test_redirect_not_found(client): + response = client.get('/nonexistent') + assert response.status_code == 404 +``` + +- [ ] **Step 2: Add redirect logic** + +```python +@app.route('/', methods=['GET']) +def redirect_to(code_): + url = storage.get(code_) + if not url: + return jsonify({"error": "Not found"}), 404 + from flask import redirect + return redirect(url) +``` + +- [ ] **Step 3: Run tests to verify** + +Run: `python3 -m pytest tests.py -v` +Expected: PASS + +- [ ] **Step 4: Commit** + +```bash +git add app.py tests.py +git commit -m "feat: implement redirect" +``` + +### Task 4: Implement Delete and List + +**Files:** +- Modify: `app.py` +- Modify: `tests.py` + +- [ ] **Step 1: Add delete and list tests** + +```python +def test_list(client): + client.post('/shorten', json={"url": "https://a.com"}) + response = client.get('/list') + assert response.status_code == 200 + assert len(response.get_json()['entries']) >= 1 + +def test_delete(client): + # Setup + client.post('/shorten', json={"url": "https://b.com"}) + res = client.get('/list') + code = res.get_json()['entries'][0]['code'] + + # Test + response = client.delete(f'/{code}') + assert response.status_code == 204 + + # Verify deletion + assert client.get(f'/{code}').status_code == 404 +``` + +- [ ] **Step 2: Add delete and list logic** + +```python +@app.route('/', methods=['DELETE']) +def delete(code_): + if code_ not in storage: + return jsonify({"error": "Not found"}), 404 + del storage[code_] + return '', 204 + +@app.route('/list', methods=['GET']) +def list_entries(): + entries = [{"code": k, "url": v} for k, v in storage.items()] + return jsonify({"entries": entries}) +``` + +- [ ] **Step 3: Run tests to verify** + +Run: `python3 -m pytest tests.py -v` +Expected: PASS + +- [ ] **Step 4: Commit** + +```bash +git add app.py tests.py +git commit -m "feat: implement delete and list" +``` diff --git a/experiments/31flashlite-amb-b/docs/superpowers/specs/2026-05-05-url-shortener-design.md b/experiments/31flashlite-amb-b/docs/superpowers/specs/2026-05-05-url-shortener-design.md new file mode 100644 index 0000000..78b9eb9 --- /dev/null +++ b/experiments/31flashlite-amb-b/docs/superpowers/specs/2026-05-05-url-shortener-design.md @@ -0,0 +1,29 @@ +# Design: URL Shortener REST API + +## Architecture +- Framework: Flask +- Storage: Python `dict` in-memory. +- Random Code Generation: `secrets.token_urlsafe(6)` for unguessable 6-char alphanumeric codes. + +## Endpoints +- `POST /shorten`: + - Input: `{"url": "..."}` + - Logic: Validate URL format (must start with `http://` or `https://`). Generate 6-char code. Store mapping `code -> url`. + - Returns: `{"code": "...", "short_url": "..."}` +- `GET /`: + - Logic: Look up code. If found, 302 redirect to original URL. If not found, 404. +- `DELETE /`: + - Logic: Look up code. If found, delete. Returns 204. If not found, 404. +- `GET /list`: + - Logic: Iterate dict. + - Returns: `{"entries": [{"code": "...", "url": "..."}, ...]}` + +## Storage +- `storage = {}` (global/module-level dict) + +## Error Handling +- Invalid URL (POST): 400 +- Not found (GET/DELETE): 404 + +## Testing +- Using `pytest` with `Flask`'s `test_client`. diff --git a/experiments/31flashlite-amb-b/requirements.txt b/experiments/31flashlite-amb-b/requirements.txt new file mode 100644 index 0000000..4649a3c --- /dev/null +++ b/experiments/31flashlite-amb-b/requirements.txt @@ -0,0 +1,2 @@ +flask +pytest diff --git a/experiments/31flashlite-amb-b/tests.py b/experiments/31flashlite-amb-b/tests.py new file mode 100644 index 0000000..773ca81 --- /dev/null +++ b/experiments/31flashlite-amb-b/tests.py @@ -0,0 +1,32 @@ +import pytest +from app import app + +@pytest.fixture +def client(): + return app.test_client() + +def test_shorten_flow(client): + # Shorten + res = client.post('/shorten', json={"url": "https://google.com"}) + assert res.status_code == 201 + code = res.get_json()['code'] + + # List + res = client.get('/list') + assert len(res.get_json()['entries']) == 1 + + # Redirect + res = client.get(f'/{code}') + assert res.status_code == 302 + + # Delete + res = client.delete(f'/{code}') + assert res.status_code == 204 + + # Verify gone + res = client.get(f'/{code}') + assert res.status_code == 404 + +def test_invalid_url(client): + res = client.post('/shorten', json={"url": "ftp://bad.com"}) + assert res.status_code == 400 diff --git a/experiments/README.md b/experiments/README.md index 39be74a..484419f 100644 --- a/experiments/README.md +++ b/experiments/README.md @@ -3,65 +3,228 @@ Empirical validation runs for the [Aegis](../README.md) judgment-free LLM-agent fact layer. Nine rounds of paired A/B comparisons across Anthropic Haiku / Sonnet, OpenAI GPT-5 / Codex / GPT-5.4-mini, and -Google Gemini 2.5/3 Flash family on four task shapes: +Google Gemini 2.5/3 Flash family on four task shapes. -- **Plan A** — ambiguous-spec greenfield (URL shortener) -- **Plan B** — brownfield with planted SEC bugs (Python `auth.py`) -- **Plan C** — multi-module refactor (notifications feature) -- **Round 9** — Go + Java brownfield (multi-language SEC dispatch validation) +Each round runs the **same task twice with the same model**: variant +**A** without the Aegis MCP server, variant **B** with it. -Each round runs the same task twice with the same model: variant **A** -without the Aegis MCP server, variant **B** with it. The deliverables -get committed alongside the agent's `run.log` (where available) so the -behaviour is reproducible. +--- + +## Overview chart + +``` +Total round directories: 52 (26 paired × 2 variants) + +Rounds by task shape: + ┌──────────────────────────────┐ + Plan A │██████████████ 14 dirs (7 pairs) + ambiguous spec │ + ├──────────────────────────────┤ + Plan B │████████████████ 16 dirs (8 pairs) + brownfield │ + ├──────────────────────────────┤ + Plan C │████████████████ 16 dirs (8 pairs) + multi-module │ + ├──────────────────────────────┤ + Initial │██████ 6 dirs (3 pairs) + (Round 1-2) │ + └──────────────────────────────┘ + +Models tested: 11 + Haiku · Sonnet · GPT-5.2 · GPT-5.3-codex · GPT-5.4 (codex) + GPT-5.4-mini · Gemini 2.5 Flash · 2.5 Flash-Lite + 3 Flash · 3.1 Flash-Lite · (Gemma family — skipped due to API errors) +``` + +--- + +## Chart 1 — Plan B brownfield: 0/3 → 3/3 fix rate + +The headline result. Each starting `auth.py` has **3 planted SEC bugs**: +`md5` password hash, timing-unsafe `==` compare, weak RNG for session +token. Cells show **bugs remaining** out of 3 (lower = better, 0 = all +fixed). The right column shows how many bugs Aegis pointed out during +the run; the agent then fixed those plus often more. + +``` +Model │ A (no aegis) │ B (with aegis) │ Δ │ aegis hits + │ bugs left /3 │ bugs left /3 │ │ +───────────────────┼──────────────────┼──────────────────┼────────┼──────── +Haiku │ 1 ▓░░ 1 ▓░░ 0 1* +Sonnet │ 3 ▓▓▓ 1 ▓░░ +2 3 +Gemini 2.5 Flash │ 3 ▓▓▓ 3 ▓▓▓ 0 2* +GPT-5.2 │ 3 ▓▓▓ 0 ░░░ +3 3 +GPT-5.3-codex │ 3 ▓▓▓ 0 ░░░ +3 3 +GPT-5.4-mini │ 3 ▓▓▓ 0 ░░░ +3 3 +GPT-5.4-mini Go │ 3 ▓▓▓ 2 ▓▓░ +1 2** (md5 missed) +GPT-5.4-mini Java │ 2 ▓▓░ 0 ░░░ +2 1** (only SEC012 fired) + │ +───────────────────┴──────────────────┴──────────────────┴────────┴──────── + + * = Plan B was a re-run after rules were tightened — original Haiku + run was on a slimmer rule set. + ** = Aegis ran with a coverage gap (SEC009 multi-language, SEC010 + Java enclosing-context). Both are now fixed in PRs #11 / #12. +``` + +**Two of the most striking observations:** + +1. **GPT-5.4-mini Java** — Aegis only flagged 1 out of 3 bugs (SEC012 + timing-unsafe), but the agent fixed all 3. The remaining md5 → + SHA-256 and `new Random()` → `SecureRandom` were **self-driven** + once Aegis put the agent into security-review mode. (The "one + finding triggers the cascade" mechanism.) +2. **GPT-5.2** — Went well beyond the prompt: replaced md5 with + OWASP-recommended **PBKDF2-HMAC-SHA256 + 16-byte salt + 210k + iterations** instead of plain SHA-256. Aegis only said "md5 is + weak"; the agent decided what to escalate to. + +--- + +## Chart 2 — Plan C multi-module: anti-paralysis ritual + +Plan C (add a `notifications` feature to a 5-module Python project) +has **clean starting code** — no planted bugs. Aegis can't show off +its rule library here, but a third ROI mechanism surfaced anyway. + +``` +Model │ A (no aegis) │ B (with aegis) + │ │ +codex (GPT-5.4) │ ✓ 5 tests pass │ ✓ 5 tests pass +Gemini 2.5 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass +Gemini 2.5 Flash-Lite │ ✓ 4 tests pass │ ✓ 4 tests pass +Gemini 3.1 Flash-Lite │ ✗ task abandoned │ ✗ task abandoned ← preview-mode planning loop, both stuck +Gemini 3 Flash │ ✓ 5 tests pass │ ✓ 5 tests pass +GPT-5.2 │ ✓ 5 tests pass │ ✓ 5 tests pass +GPT-5.3-codex │ ✓ 4 tests pass │ ✓ 4 tests pass +GPT-5.4-mini │ ✗ task abandoned │ ✓ 5 tests pass ← THIS PAIR is the finding + │ (no notifications.py + │ no tests.py; + │ 24k tokens spent + │ on design proposals) + +Cycle introductions 16/16 = 0 (clean architecture is self-stabilizing) +Public symbol breaks 16/16 = 0 (no agent removed an existing public name) +``` + +**Key data point**: **GPT-5.4-mini's A variant abandoned the task** +— spent 24,051 tokens describing two design alternatives and asking +"approve this design?" without ever writing code. The B variant of +the **same model** completed the task because the prompt's `REQUIRED +workflow: run aegis_validate.py after every .py file you write` made +file-writing mandatory. Even though Aegis surfaced 0 security +findings on the (clean) starting code, the **ritual itself** kept +the weak model action-oriented. + +--- + +## Chart 3 — The three Aegis ROI mechanisms + +Discovered empirically across the 9 rounds. Only mechanism 1 was the +designer's stated intent. + +```mermaid +flowchart TD + Trigger[Agent edits a file] --> MCP[validate_file MCP call] + MCP --> Findings{Findings emitted?} + + Findings -->|Security: SEC009/010/012/etc.| ROI1[Mechanism 1: rule-hit → fix
Plan B: 0/3 → 3/3 across 3 models] + Findings -->|Workspace: cycle / symbol_removed| ROI2[Mechanism 2: structural guardrail
0/14 hits — dead weight on clean code
but would catch real cycles] + Findings -->|nothing — empty result| ROI3[Mechanism 3: anti-paralysis ritual
Forced write-then-validate cycle
prevents weak-model planning loops] + + ROI1 --> Cascade[Often triggers cascade:
1 finding → agent rewrites whole file
g52: md5 → PBKDF2+salt+210k iter] + ROI3 --> Saved[Plan C: g54mini-mc-a abandoned
g54mini-mc-b completed same task] + + style ROI1 fill:#d4edda,stroke:#155724,color:#000 + style ROI2 fill:#fff3cd,stroke:#856404,color:#000 + style ROI3 fill:#d1ecf1,stroke:#0c5460,color:#000 +``` + +| Mechanism | Trigger | Evidence | Designed in? | +|:---:|---|---|:---:| +| **1. Rule-hit → fix** | brownfield + planted SEC bug | Plan B 3/3 models 0/3 → 3/3 | ✅ | +| **2. Structural guardrail** | cycle / public_symbol_removed | 0/14 hits (clean code = silent) | ✅ | +| **3. Anti-paralysis ritual** | weak model + any task | Plan C g54mini A abandoned vs B completed | ❌ emergent | + +--- + +## Chart 4 — Direct lineage: experiment finding → Aegis code change + +Every recent SEC PR has a specific experiment trigger. The +dogfooding loop in action: + +``` +Round 8 codex Plan A + │ + ▼ FP discovered: SEC010 fires on `secrets.choice` (the SECURE choice) + │ — agent spent a turn "fixing" already-secure code + │ + ▼─────────────────────────► PR #9: secrets./os.urandom/crypto. allowlist + +2 regression tests + + +Round 9 Go brownfield + │ + ▼ FN discovered: SEC009 doesn't fire on Go `md5.Sum(...)` + │ — agent kept md5 because aegis didn't surface it + │ + ▼─────────────────────────► PR #12: SEC009 language-aware dispatch + +8 multi-language tests + + enclosing_security_context + function-name check + + +Round 9 Java brownfield + │ + ▼ FN discovered: SEC010 inner-block `break` hides + │ `int idx = new Random().nextInt(...)` inside + │ `generateSessionToken()` + │ + ▼─────────────────────────► PR #11: enclosing_token_context + walks past inner blocks + + reads function name + +3 regression tests +``` + +| Round | Discovered | Fixed in | +|---|---|---| +| Round 8 codex Plan A | SEC010 false-positive on `secrets.choice` | PR #9 | +| Round 9 Go / Java | SEC009 multi-language coverage = 0 | PR #12 | +| Round 9 Java | SEC010 inner-block `break` hides production case | PR #11 | +| Plan A 32+ runs | SEC010 needles too narrow (URL shorteners) | PR #6 | +| Plan A entropy bypass | SEC002 misses placeholder-shaped strings | PR #6 | +| Plan B 6 runs | "What aegis is NOT" missing in README | PR #6 | + +PR #6 — #12 (the post-experiment SEC coverage and B-class rule +batches) all traced back to specific findings in this archive. + +--- ## Files - [`comparison-report.md`](comparison-report.md) — the 1199-line - rolling analysis. Round 1 → Round 9 commentary, including the - three Aegis ROI mechanisms surfaced from the data: - 1. **Rule-hit → fix** (brownfield Plan B: 0/3 → 3/3 across 3 models) - 2. **Structural guardrail** (cycle / public_symbol_removed — - dead weight on clean architectures, 0/14 hits) - 3. **Anti-paralysis ritual** (weak models complete tasks they - would otherwise abandon; Round 8 Plan C surfaced this) + rolling Round 1 → Round 9 analysis - `starting-code/` — Plan B Python brownfield fixture (3 planted SEC bugs) - `starting-go/` — Round 9 Go brownfield fixture - `starting-java/` — Round 9 Java brownfield fixture - `starting-multi/` — Plan C 5-module fixture - `prompt-*.txt` — the prompts handed to each agent. `-a.txt` is the no-aegis variant; `-b.txt` adds the `REQUIRED workflow: run - aegis_validate.py after every write` ritual instruction. + aegis_validate.py after every write` ritual instruction - `aegis_validate.py` — Python wrapper around `aegis-mcp` stdio - JSON-RPC. The agents run it after every file write. -- `eval_round_*.sh` — analysis scripts that compare each round's - before/after state against the planted bugs. + JSON-RPC. The agents run it after every file write +- `eval_round_*.sh` — analysis scripts - `--/` — one directory per agent run. - Contains the agent's deliverables plus `run.log` for codex-driven - rounds. Naming convention: + Naming convention: - models: `haiku` / `sonnet` / `flash` (Gemini 2.5) / `25flash` / `25fl` (Gemini 2.5 Flash-Lite) / `3flash` / - `31flashlite` / `codex` (GPT-5.4) / `g52` (GPT-5.2) / - `g53codex` (GPT-5.3-codex) / `g54mini` (GPT-5.4-mini) - - tasks: `amb` (Plan A) / `bf` (Plan B) / `mc` (Plan C - multi-module) / `bf-go` / `bf-java` (Round 9) + `31flashlite` (Gemini 3.1 Flash-Lite Preview) / `codex` + (GPT-5.4) / `g52` (GPT-5.2) / `g53codex` (GPT-5.3-codex) / + `g54mini` (GPT-5.4-mini) + - tasks: `amb` (Plan A) / `bf` (Plan B Python) / `mc` (Plan C + multi-module) / `bf-go` (Round 9 Go) / `bf-java` (Round 9 Java) - variants: `a` (no Aegis) / `b` (with Aegis MCP) -## Findings that drove rule changes back into Aegis - -The dogfooding loop: every round caught at least one Aegis -false-positive or false-negative that became a code change. - -| Round | Discovered | Fixed in | -|---|---|---| -| Round 8 codex | SEC010 false-positive on `secrets.choice` | aegis PR #9 | -| Round 9 Go/Java | SEC009 multi-language coverage = 0 | aegis PR #12 | -| Round 9 Java | SEC010 inner-block `break` hides production case | aegis PR #11 | - -PR #6 — #12 (the post-experiment SEC coverage and B-class rule -batches) all traced back to specific false-positives or false- -negatives surfaced in this archive. - ## Reproducing a run ```bash