Verify whether AI coding agents actually follow the instruction files they're given.
Every AI coding agent reads an instruction file. None of them prove they followed it.
You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.
RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.
npm install -g ruleprobeOr run it directly:
npx ruleprobe --helpNote: The examples below reflect the current development HEAD (53 matchers, 9 categories). The published npm v0.1.0 shipped with 15 matchers. A new release will follow.
Parse an instruction file to see what rules RuleProbe can extract. This is real output from parsing the repo's included example instruction file:
ruleprobe parse docs/example-instructions.mdExtracted 32 rules:
forbidden-no-any-type-2
Category: forbidden-pattern
Verifier: ast
Pattern: no-any (*.ts)
Source: "- No any types anywhere in the codebase"
error-no-empty-catch-6
Category: error-handling
Verifier: ast
Pattern: no-empty-catch (*.ts)
Source: "- No empty catch blocks; always handle or rethrow errors"
naming-kebab-case-files-17
Category: naming
Verifier: filesystem
Pattern: kebab-case (filenames)
Source: "- File names: kebab-case (e.g., user-service.ts, api-handler.ts)"
dependency-pinned-versions-34
Category: dependency
Verifier: filesystem
Pattern: pinned-dependencies (package.json)
Source: "- All dependencies pinned to exact versions, no ^ or ~ ranges"
...
Verify agent output against those rules. This is ruleprobe verifying its own source code:
ruleprobe verify docs/example-instructions.md ./src --format textRuleProbe Adherence Report
Agent: unknown | Model: unknown | Task: manual
Rules: 32 total | 23 passed | 9 failed | Score: 72%
FAIL error-handling/error-no-empty-catch-6
commands/run.ts:148 - found: empty catch block
utils/safe-path.ts:116 - found: empty catch block
verifier/ast-verifier.ts:248 - found: empty catch block
PASS forbidden-pattern/forbidden-no-any-type-2
PASS structure/structure-strict-mode-1
PASS structure/structure-named-exports-only-3
PASS naming/naming-kebab-case-files-17
FAIL naming/naming-camelcase-variables-18
verifier/treesitter-loader.ts:75 - found: ParserCtor
verifier/treesitter-loader.ts:76 - found: LanguageRef
PASS naming/naming-pascalcase-types-20
PASS test-requirement/test-files-exist-25
FAIL structure/structure-no-barrel-files-24
ast-checks/index.ts:5 - found: barrel file with 24 re-exports
llm/index.ts:7 - found: barrel file with 9 re-exports
PASS import-pattern/import-no-path-aliases-28
PASS forbidden-pattern/forbidden-no-console-log-4
PASS structure/structure-max-file-length-22
PASS structure/structure-jsdoc-required-21
PASS dependency/dependency-pinned-versions-34
...
By Category:
naming: 2/4 (50%)
forbidden-pattern: 4/4 (100%)
structure: 4/5 (80%)
import-pattern: 4/4 (100%)
test-requirement: 2/2 (100%)
error-handling: 1/2 (50%)
type-safety: 2/4 (50%)
code-style: 2/5 (40%)
dependency: 2/2 (100%)
Every failure includes the file, line number, and what was found. No ambiguity.
Parse. Reads 6 instruction file formats (CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, .windsurfrules) and extracts rules that can be checked mechanically. Subjective instructions like "write clean code" are reported as unparseable so you know what was skipped.
Verify. Runs each extracted rule against a directory of agent-generated code. Checks use AST parsing via ts-morph, file system inspection, and regex pattern matching. No LLM evaluation at any stage by default; results are deterministic and identical across runs.
LLM Extract (opt-in). Pass --llm-extract to send unparseable lines through an OpenAI-compatible API for a second extraction pass. LLM-extracted rules are labeled with extractionMethod: 'llm' and confidence: 'medium', and default to warning severity. Requires OPENAI_API_KEY env var. No LLM dependency is installed by default.
Compare. Point RuleProbe at outputs from two or more agents and get a side-by-side comparison table showing which rules each one followed. Useful for evaluating agents on the same task, or tracking adherence over time.
GitHub Action. Ships as a composite action you can drop into any repo. Runs ruleprobe verify on every PR, posts results as a comment, and optionally outputs reviewdog rdjson format for inline annotations. No API keys needed beyond GITHUB_TOKEN.
RuleProbe auto-discovers a config file in the working directory (or any parent). You can also pass --config <path> explicitly. Supported file names, in priority order:
ruleprobe.config.tsruleprobe.config.jsruleprobe.config.json.ruleproberc.json
A config file lets you add custom rules, override extracted rules, or exclude rules entirely:
// ruleprobe.config.ts
import { defineConfig } from 'ruleprobe';
export default defineConfig({
// Add rules that the parser can't extract from your instruction file
rules: [
{
id: 'custom-no-lodash',
category: 'import-pattern',
description: 'Ban lodash imports',
verifier: 'regex',
pattern: { type: 'banned-import', target: '*.ts', expected: 'lodash', scope: 'file' },
},
],
// Change severity or expected values on extracted rules
overrides: [
{ ruleId: 'naming-camelcase', severity: 'warning' },
{ ruleId: 'structure-max-file-length', expected: '500' },
],
// Remove rules you don't want checked
exclude: ['forbidden-no-console-log'],
});defineConfig() is a no-op passthrough that provides type checking in TypeScript configs. JSON configs work without it.
Custom rules use the same verifier types (ast, regex, filesystem) and pattern types as extracted rules. Any pattern type listed in the Supported Rule Types table works as a custom rule pattern.
Extract rules from an instruction file.
ruleprobe parse CLAUDE.md --format json
ruleprobe parse AGENTS.md --show-unparseable
ruleprobe parse AGENTS.md --llm-extract --show-unparseable--format json|text controls output format. --show-unparseable includes lines that couldn't be converted to rules. --llm-extract sends unparseable lines to an OpenAI-compatible API for additional extraction (requires OPENAI_API_KEY).
Check agent output against extracted rules.
ruleprobe verify CLAUDE.md ./output --format text
ruleprobe verify AGENTS.md ./output --agent claude --model opus-4 --format json --output report.json
ruleprobe verify AGENTS.md ./output --format markdown --severity error
ruleprobe verify AGENTS.md ./output --format rdjson
ruleprobe verify AGENTS.md ./output --config ruleprobe.config.ts
ruleprobe verify AGENTS.md ./output --llm-extract
ruleprobe verify AGENTS.md ./output --rubric-decompose
ruleprobe verify AGENTS.md ./output --project tsconfig.json--agent and --model tag the report metadata. --severity error|warning|all filters results. --output writes to a file instead of stdout. --format rdjson produces reviewdog-compatible diagnostics. --config loads a specific config file (otherwise auto-discovered). --llm-extract runs unparseable lines through an LLM for additional rule extraction. --rubric-decompose uses an LLM to break subjective instructions into weighted concrete checks (tagged with extractionMethod: 'rubric' and confidence: 'low'). Both --llm-extract and --rubric-decompose require OPENAI_API_KEY. --project enables type-aware AST checks (implicit any, unused exports, unresolved imports) using the specified tsconfig.json.
Exit codes: 0 all rules passed, 1 violations found, 2 execution error.
Compare multiple agent outputs against the same rules.
ruleprobe compare AGENTS.md ./claude-output ./copilot-output --agents claude,copilot --format markdownList available task templates or output a specific task prompt. Three templates ship with v0.1.0: rest-endpoint, utility-module, react-component.
ruleprobe tasks
ruleprobe task rest-endpointInvoke an AI agent on a task template, verify the output, and print the report in one step. Requires @anthropic-ai/claude-agent-sdk and ANTHROPIC_API_KEY for SDK mode. Alternatively, use --watch to point at a directory where you (or another agent) will write output manually.
# SDK mode: invoke Claude, verify, report
ruleprobe run CLAUDE.md --task rest-endpoint --agent claude-code --model sonnet --format text
# Watch mode: wait for output in a directory, then verify
ruleprobe run CLAUDE.md --watch ./agent-output --timeout 300 --format jsonOptions: --task, --agent, --model, --format, --output-dir, --watch, --timeout, --allow-symlinks, --config.
Drop this into .github/workflows/ruleprobe.yml:
name: RuleProbe
on: [pull_request]
jobs:
check-rules:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: moonrunnerkc/ruleprobe@v1
with:
instruction-file: AGENTS.md
output-dir: src
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}That's it. No API keys, no LLM calls, deterministic results, runs in seconds.
Note:
@v1tracks the latest v1.x release. Pin to a specific tag (e.g.,@v1.0.0) for reproducible builds.
Full options
- uses: moonrunnerkc/ruleprobe@v1
with:
instruction-file: AGENTS.md
output-dir: src
agent: ci
model: unknown
format: text
severity: all
fail-on-violation: "true"
post-comment: "true"
reviewdog-format: "false"| Input | Default | Description |
|---|---|---|
instruction-file |
(required) | Path to instruction file |
output-dir |
src |
Directory containing code to verify |
agent |
ci |
Agent identifier for report metadata |
model |
unknown |
Model identifier for report metadata |
format |
text |
Report format: text, json, or markdown |
severity |
all |
Filter: error, warning, or all |
fail-on-violation |
true |
Fail the check on any violation |
post-comment |
true |
Post results as a PR comment |
reviewdog-format |
false |
Also output rdjson for reviewdog |
Outputs: score, passed, failed, total (available to downstream steps).
Five functions cover the full pipeline:
| Function | Purpose |
|---|---|
parseInstructionFile(path) |
Parse an instruction file into a RuleSet |
verifyOutput(ruleSet, dir) |
Run rules against a code directory |
generateReport(run, ruleSet, results) |
Build an AdherenceReport with summary stats |
formatReport(report, format) |
Render as text, JSON, markdown, or rdjson |
extractRules(markdown, fileType) |
Extract rules from raw markdown content |
defineConfig(config) |
Type-safe config helper for ruleprobe.config.ts |
loadConfig(path?, searchDir?) |
Load and validate a config file |
applyConfig(ruleSet, config) |
Merge custom rules, overrides, and exclusions into a RuleSet |
extractWithLlm(ruleSet, options) |
Run LLM extraction on unparseable lines |
createOpenAiProvider(config?) |
Create an OpenAI-compatible LLM provider |
import { parseInstructionFile, verifyOutput, generateReport, formatReport } from 'ruleprobe';
const ruleSet = parseInstructionFile('CLAUDE.md');
const results = verifyOutput(ruleSet, './agent-output');
const report = generateReport(
{ agent: 'claude-code', model: 'opus-4', taskTemplateId: 'rest-endpoint',
outputDir: './agent-output', timestamp: new Date().toISOString(), durationSeconds: null },
ruleSet,
results,
);
console.log(formatReport(report, 'text'));LLM-assisted extraction (opt-in):
import { parseInstructionFile, extractWithLlm, createOpenAiProvider } from 'ruleprobe';
const ruleSet = parseInstructionFile('CLAUDE.md');
const provider = createOpenAiProvider({ model: 'gpt-4o-mini' });
const enhanced = await extractWithLlm(ruleSet, { provider });
// enhanced.rules now includes LLM-extracted rules with extractionMethod: 'llm'flowchart LR
A[Instruction File] --> B[Rule Parser]
B --> C[RuleSet]
D[Agent Output] --> E[Verifier]
C --> E
E --> F[Adherence Report]
The parser reads your instruction file and identifies lines that map to deterministic checks (naming conventions, forbidden patterns, structural requirements). Each rule gets a category, a verifier type, and a pattern. The verifier walks the agent's output directory, runs AST checks via ts-morph for code structure rules, file system checks for naming and test file requirements, and regex checks for line length and content patterns. The report collects pass/fail results with evidence for every rule.
53 built-in matchers across 9 categories:
| Category | Count | Verifier(s) |
|---|---|---|
| naming | 7 | AST, Filesystem, Tree-sitter |
| forbidden-pattern | 5 | AST, Regex |
| structure | 9 | AST, Filesystem |
| test-requirement | 5 | AST, Filesystem, Regex |
| import-pattern | 6 | AST, Regex |
| error-handling | 2 | AST |
| type-safety | 5 | AST, Regex |
| code-style | 10 | AST, Regex, Tree-sitter |
| dependency | 1 | Filesystem |
Full table with example instructions and check details: docs/matchers.md
Most of RuleProbe works offline with no API keys. Two opt-in features use external APIs:
| Feature | Flag(s) | Required env var | When you need it |
|---|---|---|---|
| LLM rule extraction | --llm-extract |
OPENAI_API_KEY |
Extracting rules from unparseable instruction lines |
| Rubric decomposition | --rubric-decompose |
OPENAI_API_KEY |
Breaking subjective rules into concrete checks |
| Agent invocation (SDK mode) | ruleprobe run --agent claude-code |
ANTHROPIC_API_KEY |
Invoking Claude to generate code, then verifying |
| GitHub Action | uses: moonrunnerkc/ruleprobe@v1 |
GITHUB_TOKEN |
CI, PR comments |
parse, verify, compare, tasks, and task work entirely offline. No key needed.
Python and Go get naming and function-length checks via tree-sitter WASM grammars. The grammar packages (tree-sitter-python, tree-sitter-go, web-tree-sitter) ship as regular dependencies; no extra install step is required. WASM binaries are loaded at runtime from the installed packages. If loading fails (unsupported platform, missing native build), tree-sitter checks are skipped and other verifiers still run.
RuleProbe never executes scanned code, never makes network calls (unless you opt in with --llm-extract, --rubric-decompose, or ruleprobe run), and never modifies files in the scanned directory. User-supplied paths are resolved and bounded to the working directory; symlinks outside the project are skipped unless you pass --allow-symlinks. All dependencies are pinned to exact versions. See SECURITY.md for the full model.
What v0.1.0 doesn't do, stated plainly.
- TypeScript gets the deepest coverage. ts-morph gives full AST analysis for TypeScript and JavaScript: naming, forbidden patterns, structure, imports, type-safety, and code-style checks. Python and Go get naming and function-length checks via tree-sitter WASM grammars (grammar packages ship as regular dependencies; see the Tree-sitter Support section). Everything else falls back to regex (line length, comments, semicolons). No Rust, Java, or C# AST support yet.
- Subjective rules stay subjective. "Write clean code" has no deterministic check. The
--rubric-decomposeflag on theverifycommand uses an LLM to break subjective instructions into weighted concrete checks (max function length, no magic numbers, etc.), tagged withextractionMethod: 'rubric'andconfidence: 'low'. This is a proxy, not a direct evaluation. Lines with no measurable proxy stay in the unparseable array. RequiresOPENAI_API_KEY. - Agent invocation covers Claude SDK and watch mode only. The
runcommand invokes agents via the Claude Agent SDK (requiresANTHROPIC_API_KEY) or watches a directory for output. Copilot, Cursor, and other agent SDKs are not integrated; use--watchmode for those. - Type-aware checks require --project. Three checks (implicit any, unused exports, unresolved imports) need the TypeChecker, which requires a
tsconfig.json. Without--project, ts-morph parses files in isolation and these checks are skipped. - 53 matchers, not infinite. The parser skips lines it can't confidently map to a check. Use
--show-unparseableto see what was missed, and--llm-extractor--rubric-decomposeto handle the remainder.
See docs/case-study-v0.1.0.md for a comparison of two agents on the rest-endpoint task template against 10 rules.
git clone https://github.com/moonrunnerkc/ruleprobe.git
cd ruleprobe && npm install
npm testIssues and pull requests welcome at github.com/moonrunnerkc/ruleprobe.