This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
cc-plugin-eval is a 4-stage evaluation framework for testing Claude Code plugin component triggering. It evaluates whether skills, agents, commands, hooks, and MCP servers correctly activate when expected.
Requirements: Node.js >= 20.0.0, Anthropic API key (in .env as ANTHROPIC_API_KEY)
# Build & Dev
npm run build # Compile TypeScript to dist/
npm run dev # Watch mode
# Lint & Type Check
npm run lint # ESLint
npm run lint:fix # Auto-fix
npm run format # Prettier auto-fix
npm run format:check # Prettier check only
npm run typecheck # tsc --noEmit
# Test
npm run test # All tests (Vitest)
npm run test:watch # Watch mode
npm run test:coverage # With coverage
npm run test:ui # Visual test UI (opens browser)
# Single test file
npx vitest run tests/unit/stages/1-analysis/skill-analyzer.test.ts
# Tests matching pattern
npx vitest run -t "SkillAnalyzer"
# E2E tests (requires API key, costs money)
RUN_E2E_TESTS=true npm test -- tests/e2e/
RUN_E2E_TESTS=true E2E_MAX_COST_USD=2.00 npm test -- tests/e2e/Test behavior: Parallel execution, randomized order, 30s timeout. CI retries failed tests twice.
npx prettier --check "src/**/*.ts" "*.json" "*.md"
markdownlint "*.md"
uvx yamllint -c .yamllint.yml config.yaml .yamllint.yml
actionlint .github/workflows/*.ymlcc-plugin-eval run -p ./plugin # Full pipeline
cc-plugin-eval analyze -p ./plugin # Stage 1 only
cc-plugin-eval generate -p ./plugin # Stages 1-2
cc-plugin-eval execute -p ./plugin # Stages 1-3
cc-plugin-eval run -p ./plugin --dry-run # Cost estimation only
cc-plugin-eval resume -r <run-id> # Resume interrupted run
cc-plugin-eval run -p ./plugin --fast # Re-run failed scenarios only| Stage | Purpose | Output |
|---|---|---|
| 1. Analysis | Parse plugin structure, extract triggers | analysis.json |
| 2. Generation | Create test scenarios (LLM for skills/agents, deterministic for commands/hooks/MCP) | scenarios.json |
| 3. Execution | Run scenarios via Claude Agent SDK with tool capture | transcripts/ |
| 4. Evaluation | Programmatic detection first, LLM judge fallback, metrics calculation | evaluation.json |
| Component | File | Main Export |
|---|---|---|
| CLI | src/index.ts |
Commander program |
| Stage 1 | src/stages/1-analysis/index.ts |
runAnalysis() |
| Stage 2 | src/stages/2-generation/index.ts |
runGeneration() |
| Stage 3 | src/stages/3-execution/index.ts |
runExecution() |
| Stage 4 | src/stages/4-evaluation/index.ts |
runEvaluation() |
| Detection | src/stages/4-evaluation/programmatic-detector.ts |
detectAllComponents() |
| Conflict Tracking | src/stages/4-evaluation/conflict-tracker.ts |
calculateConflictSeverity() |
| Metrics | src/stages/4-evaluation/metrics.ts |
calculateEvalMetrics() |
| State | src/state/state-manager.ts |
loadState(), saveState() |
This project has MCP tools configured for efficient code exploration and editing.
Use this decision tree to pick the right tool:
| Task | Tool | Why |
|---|---|---|
| Edit any file | edit_file (Morph) |
10,500 tok/s, 98% accuracy, partial snippets |
| Find symbol by exact name | find_symbol (Serena) |
LSP-powered precision, no false positives |
| Find all callers/usages | find_referencing_symbols (Serena) |
Semantic analysis of symbol relationships |
| Explore unfamiliar code | warpgrep_codebase_search (Morph) |
Autonomous sub-agent, parallel search |
| Understand file structure | get_symbols_overview (Serena) |
Quick overview without reading full file |
| Rename across codebase | rename_symbol (Serena) |
Safe refactoring with LSP |
| Exact text pattern | rg or Grep |
Fast literal/regex matching |
| Edit a complete method | replace_symbol_body (Serena) |
When replacing entire function body |
When to use Morph tools:
edit_file: ALL file edits (faster than search-and-replace, handles partial context)warpgrep_codebase_search: Exploring code you don't know ("how does X work?", "where is Y handled?")
When to use Serena tools:
find_symbol: You know the symbol name and want its location/bodyfind_referencing_symbols: Understanding call sites before refactoringreplace_symbol_body: Replacing a complete function/method (cleaner than edit_file for whole symbols)get_symbols_overview: Understanding a file's structure without reading it
// ALWAYS include the instruction parameter - it disambiguates edits
edit_file({
path: "/abs/path/to/file.ts",
instruction: "Add timeout to fetch call", // Required for clarity
code_edit: `
export async function fetchData(url: string) {
// ... existing code ...
const response = await fetch(url, {
headers,
timeout: 5000 // added timeout
});
// ... existing code ...
}
`,
});Key patterns:
- Use
// ... existing code ...with hints:// ... keep validation logic ... - Batch all edits to the same file in one call
- Preserve exact indentation from the original file
- For deletions: show 1-2 context lines and omit the deleted code
- Use
dryRun: trueto preview changes without applying
| Scenario | Use This |
|---|---|
| "Where is the auth flow?" | warpgrep_codebase_search |
"Find the runEvaluation function" |
find_symbol |
"What calls detectFromCaptures?" |
find_referencing_symbols |
| "How does conflict detection work?" | warpgrep_codebase_search |
| "Show me all exports in types/index.ts" | get_symbols_overview |
Understanding a stage: Use get_symbols_overview on the stage's index.ts, then find_referencing_symbols on the main export to see how it integrates with the pipeline.
Refactoring types: Use find_referencing_symbols on a type from src/types/ to find all usages before making changes.
Tracing detection logic: The detection flow is detectAllComponents → detectFromCaptures / detectFromTranscript → type-specific detectors. Agent detection uses SubagentStart/SubagentStop hooks. Use find_symbol to navigate this chain.
Adding a new component type: Follow the type through all four stages using find_referencing_symbols on similar component types (e.g., trace how hooks is handled to understand where to add mcp_servers).
Symbol-First Philosophy: Never read entire source files when you can use symbolic tools. Use get_symbols_overview to understand file structure, then find_symbol with include_body=true only for the specific symbols you need.
Name Path Patterns: Serena uses hierarchical name paths like ClassName/methodName. Examples:
runEvaluation- Matches any symbol namedrunEvaluationProgrammaticDetector/detectFromCaptures- Matches method in class/ClassName/method- Absolute path (exact match required)- Use
substring_matching=truefor partial matches:detectfindsdetectFromCaptures,detectAllComponents
Key Parameters:
| Parameter | Use Case |
|---|---|
depth=1 |
Get class methods: find_symbol("ClassName", depth=1) |
include_body=true |
Get actual code (use sparingly) |
relative_path |
Restrict search scope for speed |
restrict_search_to_code_files |
In search_for_pattern, limits to TypeScript files |
Non-Code File Search: Use search_for_pattern (not find_symbol) for YAML, JSON, markdown:
search_for_pattern("pattern", paths_include_glob="*.json")
Serena Memories: This project has pre-built memories in .serena/memories/. Read relevant ones before major changes:
| Memory | When to Read |
|---|---|
architecture_decisions |
Before changing detection logic or pipeline structure |
testing_patterns |
Before writing tests |
code_style |
Before writing new code |
Thinking Tools: Use Serena's thinking tools at key points:
think_about_collected_information- After searching, before actingthink_about_task_adherence- Before making editsthink_about_whether_you_are_done- Before completing a task
src/
├── index.ts # CLI entry point (env.ts MUST be first import)
├── env.ts # Environment setup (dotenv loading)
├── config/ # Configuration loading with Zod validation
│ ├── defaults.ts # Default configuration values
│ ├── loader.ts # YAML/JSON config loading
│ ├── pricing.ts # Model pricing for cost estimation
│ └── schema.ts # Zod validation schemas
├── stages/
│ ├── 1-analysis/ # Plugin parsing, trigger extraction
│ ├── 2-generation/ # Scenario generation (LLM + deterministic)
│ ├── 3-execution/ # Agent SDK integration, tool capture
│ └── 4-evaluation/ # Programmatic detection, LLM judge, metrics
├── state/ # Resume capability, checkpointing
├── types/ # TypeScript interfaces
└── utils/ # Retry, concurrency, logging utilities
tests/
├── unit/ # Unit tests (mirror src/ structure)
├── integration/ # Integration tests for full stages
├── e2e/ # End-to-end tests (real SDK calls)
├── mocks/ # Mock implementations for testing
└── fixtures/ # Test data and mock plugins
When adding support for a new plugin component type (e.g., a new kind of trigger):
- Define types in
src/types/ - Create analyzer in
src/stages/1-analysis/ - Create scenario generator in
src/stages/2-generation/ - Extend detection in
src/stages/4-evaluation/programmatic-detector.ts - Update
AnalysisOutputinsrc/types/state.ts - Add to pipeline in
src/stages/{1,2,4}-*/index.ts - Add state migration in
src/state/state-manager.ts(provide defaults for legacy state) - Add tests
When adding new component types, update migrateState() in src/state/state-manager.ts to provide defaults (e.g., hooks: legacyComponents.hooks ?? []) so existing state files remain compatible.
The CLI uses a handler map in src/index.ts for stage-based resume. State files are stored at results/<plugin-name>/<run-id>/state.json.
Enable: scope.hooks: true
Hooks use the EventType::Matcher format (e.g., "PreToolUse::Write|Edit"). Detection happens via SDKHookResponseMessage events with 100% confidence. Scenarios are generated deterministically via tool-to-prompt mapping.
Limitation: Session lifecycle hooks (SessionStart, SessionEnd) fire once per session.
Enable: scope.mcp_servers: true
Tools are detected via the pattern mcp__<server>__<tool>. Scenarios are generated deterministically (zero LLM cost). The SDK auto-connects to servers defined in .mcp.json.
Limitation: Tool schemas are not validated.
Use cause chains for error context. See src/config/loader.ts:ConfigLoadError for the pattern.
Use type guards for tool detection in src/stages/4-evaluation/programmatic-detector.ts. Examples include isSkillInput() and isTaskInput().
Use src/utils/concurrency.ts for controlled parallel execution with progress callbacks. The utility handles error aggregation and respects concurrency limits.
Use src/utils/retry.ts for API calls. It implements exponential backoff with configurable max attempts and handles transient failures gracefully.
All configuration uses Zod schemas in src/config/. The loader validates at runtime and provides clear error messages for invalid configuration.
Unit tests live in tests/unit/ and mirror the src/ structure. They use Vitest with vi.mock() for dependencies.
Integration tests in tests/integration/ test full stage execution with real fixtures but mocked LLM calls.
E2E tests in tests/e2e/ make real API calls and cost money. They are skipped by default and enabled via RUN_E2E_TESTS=true. Budget limits are enforced via E2E_MAX_COST_USD.
Test fixtures live in tests/fixtures/. Sample transcripts are in tests/fixtures/sample-transcripts/. Mock plugins are in tests/fixtures/valid-plugin/.
The project uses GitHub Actions for CI. Key workflows:
| Workflow | Purpose |
|---|---|
ci.yml |
Build, lint, typecheck, test on PR and push |
ci-failure-analysis.yml |
AI analysis of CI failures |
claude-pr-review.yml |
AI-powered code review on PRs |
claude-issue-analysis.yml |
AI-powered issue analysis |
claude.yml |
Claude Code interactive workflow |
semantic-labeler.yml |
Auto-label issues and PRs based on content |
markdownlint.yml |
Markdown linting |
yaml-lint.yml |
YAML linting |
validate-workflows.yml |
Validate GitHub Actions workflows with actionlint |
links.yml |
Check for broken links in documentation |
sync-labels.yml |
Sync repository labels from labels.yml |
stale.yml |
Mark and close stale issues/PRs |
greet.yml |
Welcome new contributors |
CI runs tests in parallel with randomized order. Failed tests are retried twice before marking as failed.
Use GraphQL mutations to set up issue dependencies (blocked by / blocks relationships).
Get issue node IDs:
gh issue list --state open --json number,id | jq -r '.[] | "\(.number)\t\(.id)"'Add a blocking relationship (issueId is blocked by blockingIssueId):
gh api graphql -f query='
mutation {
addBlockedBy(input: {
issueId: "I_kwDO...",
blockingIssueId: "I_kwDO..."
}) {
issue { number title }
blockingIssue { number title }
}
}'Remove a blocking relationship:
gh api graphql -f query='
mutation {
removeBlockedBy(input: {
issueId: "I_kwDO...",
blockingIssueId: "I_kwDO..."
}) {
issue { number title }
blockingIssue { number title }
}
}'Example: To make #205 block #207 (meaning #207 is blocked by #205):
issueId= #207's node ID (the blocked issue)blockingIssueId= #205's node ID (the blocking issue)