This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
GitHub: https://github.com/JoshuaOliphant/reading-tracker
# Install dependencies
uv sync
# Create .env file with your API key (required for app and LLM-graded evals)
echo "ANTHROPIC_API_KEY=your_key_here" > .env# Run the application (must run from repo root — data/ path is relative)
uv run uvicorn app.main:app --reload
# Run all tests
uv run pytest
# Run message passing evaluation tests
uv run pytest evals/test_message_passing.py -v -sThis is a hexagonal agent application implementing a multi-agent system with message passing between AI agents acting as "prompt objects."
Browser (HTMX) → FastAPI (/agent) → SavedViewsManager (fast path)
→ AgentRouter (slow path) → UI Agent
→ Recommender Agent
→ Insights Agent
- Fast path:
SavedViewsManager.find_matching_view()checks for cached user-saved views - Slow path:
AgentRouter.process_user_message()routes to UI agent, which may delegate to specialists
- AgentRouter (
router.py): Coordinates agents, routes inter-agent messages, maintains MessageLog for debugging - BaseAgent (
base_agent.py): Abstract base providingmessage_agenttool for inter-agent communication - UIAgent (
ui_agent.py): Handles user interaction, generates HTML, coordinates specialists - RecommenderAgent (
recommender_agent.py): Book recommendations based on reading history - InsightsAgent (
insights_agent.py): Analyzes reading patterns and behavior
Agents communicate via semantic messages, not method calls. The message_agent tool routes through the router:
await router.route_agent_message(from_agent="ui", to_agent="recommender", message="...")MCP tool definitions for book CRUD operations. Each tool returns structured JSON that agents transform into HTML. Tools: list_books, get_book, create_book, update_book, delete_book, search_books, get_stats.
Markdown files defining agent personalities and UI generation rules:
ui.md: HTML component patterns, HTMX integration, design systemrecommender.md: Recommendation strategy and toneinsights.md: Pattern analysis approach
Progressive UI caching system. Users can save agent-generated views for instant loading:
- Exact phrase match (highest priority) or keyword match (fallback)
- Dynamic views inject fresh data on load via
{{placeholder}}syntax
FastAPI endpoints with HTMX integration:
GET /- Main page with BASE_TEMPLATE shellPOST /agent- Message processing (checks saved views first)GET /views/list,POST /views/save,GET /views/load/{id}- View managementGET /debug/messages,GET /debug/agents,GET /debug/views- Debug endpoints
SQLite database and JSON in data/:
data/reading_list.db- SQLite database for books (ACID-compliant persistence)data/saved_views.json- Cached UI views
Comprehensive eval suite following Anthropic's eval guide.
# All evals (~20min with agent tests, or fast subset below)
uv run pytest evals/ -v
# Grader and transcript unit tests (fast, no LLM calls)
uv run pytest evals/test_graders.py evals/test_transcript.py -v
# Individual eval suites (see Eval Architecture table for file purposes)
uv run pytest evals/test_<suite>.py -v -s
# LLM grader calibration (standalone script, not pytest)
uv run python evals/calibrate_llm_grader.py
# Enable transcript capture to disk
CAPTURE_TRANSCRIPTS=1 uv run pytest evals/ -v -s| Component | File(s) | Purpose |
|---|---|---|
| Transcript Capture | evals/transcript.py |
Records tool calls, agent messages, DB state before/after, timing |
| Reusable Graders | evals/graders.py |
StateCheck, ToolWasCalled, ToolNotCalled, HTMLContains, PartialCredit, CompositeGrader |
| Declarative Cases | evals/datasets/crud_cases.py |
Case and Dataset classes for data-driven test definitions |
| LLM Grading | evals/test_llm_graded.py |
Claude-based subjective quality assessment with calibration examples |
| Calibration | evals/calibrate_llm_grader.py |
Validates LLM grader against 10 hand-labeled examples (target: >90% agreement) |
| Shared Fixtures | evals/conftest.py |
DB isolation, router instances, transcript capture, .env loading |
- State-based outcomes: Verify database state, not agent UI claims
- Tool usage verification: Assert tools are actually called, not just plausible UI generated
- pass@k / pass^k: Reliability metrics across k trials (configurable via
EVAL_TRIALSenv) - Negative testing: Verify agent asks for clarification, doesn't hallucinate, doesn't expose internals
- Partial credit: Multi-step tasks scored 0.0-1.0 with weighted steps, not just pass/fail
- LLM grading: Subjective quality (tone, helpfulness, accessibility) graded by Claude with calibrated rubrics
- Per-test isolation: Fresh database state per test via
clean_db/seeded_dbfixtures
- All code files start with 2-line ABOUTME comments explaining the file's purpose
- Agents output raw HTML (skill files enforce "never use markdown code fences")
- HTMX attributes (
hx-post,hx-target,hx-vals,hx-indicator) drive UI interactions - Tool responses are structured JSON; agents transform data into HTML presentation
- Async-first: All database and agent code uses
async/await; tests use@pytest.mark.asyncio(no auto mode configured)
- CWD matters:
DATABASE_PATH = Path("data/reading_list.db")is relative — app and tests must run from repo root python-dotenvis a transitive dep: Used inapp/main.pyandevals/conftest.pybut comes throughuvicorn[standard], not declared directly. If the dep chain changes,load_dotenv()will break silently- LLM-graded evals fail silently without API key:
llm_grade()returns a default pass whenANTHROPIC_API_KEYis missing — tests appear green but skip actual grading. Always check.envis loaded - State isolation in agent evals: Each eval case needs a fresh
AgentRouter()and clean DB state — agents retain conversational context across calls toprocess_user_message()