Skip to content

moonrunnerkc/ruleprobe

Repository files navigation

RuleProbe

Verify whether AI coding agents actually follow the instruction files they're given.

npm version build status license TypeScript Node.js >= 18 GitHub stars

Why

Every AI coding agent reads an instruction file. None of them prove they followed it.

You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.

RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.

Quick Start

npm install -g ruleprobe

Or run it directly:

npx ruleprobe --help

Note: The examples below reflect the current development HEAD (53 matchers, 9 categories). The published npm v0.1.0 shipped with 15 matchers. A new release will follow.

Parse an instruction file to see what rules RuleProbe can extract. This is real output from parsing the repo's included example instruction file:

ruleprobe parse docs/example-instructions.md
Extracted 32 rules:

  forbidden-no-any-type-2
    Category: forbidden-pattern
    Verifier: ast
    Pattern:  no-any (*.ts)
    Source:    "- No any types anywhere in the codebase"

  error-no-empty-catch-6
    Category: error-handling
    Verifier: ast
    Pattern:  no-empty-catch (*.ts)
    Source:    "- No empty catch blocks; always handle or rethrow errors"

  naming-kebab-case-files-17
    Category: naming
    Verifier: filesystem
    Pattern:  kebab-case (filenames)
    Source:    "- File names: kebab-case (e.g., user-service.ts, api-handler.ts)"

  dependency-pinned-versions-34
    Category: dependency
    Verifier: filesystem
    Pattern:  pinned-dependencies (package.json)
    Source:    "- All dependencies pinned to exact versions, no ^ or ~ ranges"
  ...

Verify agent output against those rules. This is ruleprobe verifying its own source code:

ruleprobe verify docs/example-instructions.md ./src --format text
RuleProbe Adherence Report
Agent: unknown | Model: unknown | Task: manual

Rules: 32 total | 23 passed | 9 failed | Score: 72%

FAIL  error-handling/error-no-empty-catch-6
      commands/run.ts:148 - found: empty catch block
      utils/safe-path.ts:116 - found: empty catch block
      verifier/ast-verifier.ts:248 - found: empty catch block
PASS  forbidden-pattern/forbidden-no-any-type-2
PASS  structure/structure-strict-mode-1
PASS  structure/structure-named-exports-only-3
PASS  naming/naming-kebab-case-files-17
FAIL  naming/naming-camelcase-variables-18
      verifier/treesitter-loader.ts:75 - found: ParserCtor
      verifier/treesitter-loader.ts:76 - found: LanguageRef
PASS  naming/naming-pascalcase-types-20
PASS  test-requirement/test-files-exist-25
FAIL  structure/structure-no-barrel-files-24
      ast-checks/index.ts:5 - found: barrel file with 24 re-exports
      llm/index.ts:7 - found: barrel file with 9 re-exports
PASS  import-pattern/import-no-path-aliases-28
PASS  forbidden-pattern/forbidden-no-console-log-4
PASS  structure/structure-max-file-length-22
PASS  structure/structure-jsdoc-required-21
PASS  dependency/dependency-pinned-versions-34
...

By Category:
  naming:             2/4 (50%)
  forbidden-pattern:  4/4 (100%)
  structure:          4/5 (80%)
  import-pattern:     4/4 (100%)
  test-requirement:   2/2 (100%)
  error-handling:     1/2 (50%)
  type-safety:        2/4 (50%)
  code-style:         2/5 (40%)
  dependency:         2/2 (100%)

Every failure includes the file, line number, and what was found. No ambiguity.

What It Does

Parse. Reads 6 instruction file formats (CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, .windsurfrules) and extracts rules that can be checked mechanically. Subjective instructions like "write clean code" are reported as unparseable so you know what was skipped.

Verify. Runs each extracted rule against a directory of agent-generated code. Checks use AST parsing via ts-morph, file system inspection, and regex pattern matching. No LLM evaluation at any stage by default; results are deterministic and identical across runs.

LLM Extract (opt-in). Pass --llm-extract to send unparseable lines through an OpenAI-compatible API for a second extraction pass. LLM-extracted rules are labeled with extractionMethod: 'llm' and confidence: 'medium', and default to warning severity. Requires OPENAI_API_KEY env var. No LLM dependency is installed by default.

Compare. Point RuleProbe at outputs from two or more agents and get a side-by-side comparison table showing which rules each one followed. Useful for evaluating agents on the same task, or tracking adherence over time.

GitHub Action. Ships as a composite action you can drop into any repo. Runs ruleprobe verify on every PR, posts results as a comment, and optionally outputs reviewdog rdjson format for inline annotations. No API keys needed beyond GITHUB_TOKEN.

Configuration

RuleProbe auto-discovers a config file in the working directory (or any parent). You can also pass --config <path> explicitly. Supported file names, in priority order:

  • ruleprobe.config.ts
  • ruleprobe.config.js
  • ruleprobe.config.json
  • .ruleproberc.json

A config file lets you add custom rules, override extracted rules, or exclude rules entirely:

// ruleprobe.config.ts
import { defineConfig } from 'ruleprobe';

export default defineConfig({
  // Add rules that the parser can't extract from your instruction file
  rules: [
    {
      id: 'custom-no-lodash',
      category: 'import-pattern',
      description: 'Ban lodash imports',
      verifier: 'regex',
      pattern: { type: 'banned-import', target: '*.ts', expected: 'lodash', scope: 'file' },
    },
  ],

  // Change severity or expected values on extracted rules
  overrides: [
    { ruleId: 'naming-camelcase', severity: 'warning' },
    { ruleId: 'structure-max-file-length', expected: '500' },
  ],

  // Remove rules you don't want checked
  exclude: ['forbidden-no-console-log'],
});

defineConfig() is a no-op passthrough that provides type checking in TypeScript configs. JSON configs work without it.

Custom rules use the same verifier types (ast, regex, filesystem) and pattern types as extracted rules. Any pattern type listed in the Supported Rule Types table works as a custom rule pattern.

CLI Reference

ruleprobe parse <instruction-file>

Extract rules from an instruction file.

ruleprobe parse CLAUDE.md --format json
ruleprobe parse AGENTS.md --show-unparseable
ruleprobe parse AGENTS.md --llm-extract --show-unparseable

--format json|text controls output format. --show-unparseable includes lines that couldn't be converted to rules. --llm-extract sends unparseable lines to an OpenAI-compatible API for additional extraction (requires OPENAI_API_KEY).

ruleprobe verify <instruction-file> <output-dir>

Check agent output against extracted rules.

ruleprobe verify CLAUDE.md ./output --format text
ruleprobe verify AGENTS.md ./output --agent claude --model opus-4 --format json --output report.json
ruleprobe verify AGENTS.md ./output --format markdown --severity error
ruleprobe verify AGENTS.md ./output --format rdjson
ruleprobe verify AGENTS.md ./output --config ruleprobe.config.ts
ruleprobe verify AGENTS.md ./output --llm-extract
ruleprobe verify AGENTS.md ./output --rubric-decompose
ruleprobe verify AGENTS.md ./output --project tsconfig.json

--agent and --model tag the report metadata. --severity error|warning|all filters results. --output writes to a file instead of stdout. --format rdjson produces reviewdog-compatible diagnostics. --config loads a specific config file (otherwise auto-discovered). --llm-extract runs unparseable lines through an LLM for additional rule extraction. --rubric-decompose uses an LLM to break subjective instructions into weighted concrete checks (tagged with extractionMethod: 'rubric' and confidence: 'low'). Both --llm-extract and --rubric-decompose require OPENAI_API_KEY. --project enables type-aware AST checks (implicit any, unused exports, unresolved imports) using the specified tsconfig.json.

Exit codes: 0 all rules passed, 1 violations found, 2 execution error.

ruleprobe compare <instruction-file> <dirs...>

Compare multiple agent outputs against the same rules.

ruleprobe compare AGENTS.md ./claude-output ./copilot-output --agents claude,copilot --format markdown

ruleprobe tasks / ruleprobe task <id>

List available task templates or output a specific task prompt. Three templates ship with v0.1.0: rest-endpoint, utility-module, react-component.

ruleprobe tasks
ruleprobe task rest-endpoint

ruleprobe run <instruction-file>

Invoke an AI agent on a task template, verify the output, and print the report in one step. Requires @anthropic-ai/claude-agent-sdk and ANTHROPIC_API_KEY for SDK mode. Alternatively, use --watch to point at a directory where you (or another agent) will write output manually.

# SDK mode: invoke Claude, verify, report
ruleprobe run CLAUDE.md --task rest-endpoint --agent claude-code --model sonnet --format text

# Watch mode: wait for output in a directory, then verify
ruleprobe run CLAUDE.md --watch ./agent-output --timeout 300 --format json

Options: --task, --agent, --model, --format, --output-dir, --watch, --timeout, --allow-symlinks, --config.

GitHub Action

Drop this into .github/workflows/ruleprobe.yml:

name: RuleProbe
on: [pull_request]
jobs:
  check-rules:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    steps:
      - uses: actions/checkout@v4
      - uses: moonrunnerkc/ruleprobe@v1
        with:
          instruction-file: AGENTS.md
          output-dir: src
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

That's it. No API keys, no LLM calls, deterministic results, runs in seconds.

Note: @v1 tracks the latest v1.x release. Pin to a specific tag (e.g., @v1.0.0) for reproducible builds.

Full options
- uses: moonrunnerkc/ruleprobe@v1
  with:
    instruction-file: AGENTS.md
    output-dir: src
    agent: ci
    model: unknown
    format: text
    severity: all
    fail-on-violation: "true"
    post-comment: "true"
    reviewdog-format: "false"
Input Default Description
instruction-file (required) Path to instruction file
output-dir src Directory containing code to verify
agent ci Agent identifier for report metadata
model unknown Model identifier for report metadata
format text Report format: text, json, or markdown
severity all Filter: error, warning, or all
fail-on-violation true Fail the check on any violation
post-comment true Post results as a PR comment
reviewdog-format false Also output rdjson for reviewdog

Outputs: score, passed, failed, total (available to downstream steps).

Programmatic API

Five functions cover the full pipeline:

Function Purpose
parseInstructionFile(path) Parse an instruction file into a RuleSet
verifyOutput(ruleSet, dir) Run rules against a code directory
generateReport(run, ruleSet, results) Build an AdherenceReport with summary stats
formatReport(report, format) Render as text, JSON, markdown, or rdjson
extractRules(markdown, fileType) Extract rules from raw markdown content
defineConfig(config) Type-safe config helper for ruleprobe.config.ts
loadConfig(path?, searchDir?) Load and validate a config file
applyConfig(ruleSet, config) Merge custom rules, overrides, and exclusions into a RuleSet
extractWithLlm(ruleSet, options) Run LLM extraction on unparseable lines
createOpenAiProvider(config?) Create an OpenAI-compatible LLM provider
import { parseInstructionFile, verifyOutput, generateReport, formatReport } from 'ruleprobe';

const ruleSet = parseInstructionFile('CLAUDE.md');
const results = verifyOutput(ruleSet, './agent-output');
const report = generateReport(
  { agent: 'claude-code', model: 'opus-4', taskTemplateId: 'rest-endpoint',
    outputDir: './agent-output', timestamp: new Date().toISOString(), durationSeconds: null },
  ruleSet,
  results,
);
console.log(formatReport(report, 'text'));

LLM-assisted extraction (opt-in):

import { parseInstructionFile, extractWithLlm, createOpenAiProvider } from 'ruleprobe';

const ruleSet = parseInstructionFile('CLAUDE.md');
const provider = createOpenAiProvider({ model: 'gpt-4o-mini' });
const enhanced = await extractWithLlm(ruleSet, { provider });
// enhanced.rules now includes LLM-extracted rules with extractionMethod: 'llm'

How It Works

flowchart LR
    A[Instruction File] --> B[Rule Parser]
    B --> C[RuleSet]
    D[Agent Output] --> E[Verifier]
    C --> E
    E --> F[Adherence Report]
Loading

The parser reads your instruction file and identifies lines that map to deterministic checks (naming conventions, forbidden patterns, structural requirements). Each rule gets a category, a verifier type, and a pattern. The verifier walks the agent's output directory, runs AST checks via ts-morph for code structure rules, file system checks for naming and test file requirements, and regex checks for line length and content patterns. The report collects pass/fail results with evidence for every rule.

Supported Rule Types

53 built-in matchers across 9 categories:

Category Count Verifier(s)
naming 7 AST, Filesystem, Tree-sitter
forbidden-pattern 5 AST, Regex
structure 9 AST, Filesystem
test-requirement 5 AST, Filesystem, Regex
import-pattern 6 AST, Regex
error-handling 2 AST
type-safety 5 AST, Regex
code-style 10 AST, Regex, Tree-sitter
dependency 1 Filesystem

Full table with example instructions and check details: docs/matchers.md

Authentication

Most of RuleProbe works offline with no API keys. Two opt-in features use external APIs:

Feature Flag(s) Required env var When you need it
LLM rule extraction --llm-extract OPENAI_API_KEY Extracting rules from unparseable instruction lines
Rubric decomposition --rubric-decompose OPENAI_API_KEY Breaking subjective rules into concrete checks
Agent invocation (SDK mode) ruleprobe run --agent claude-code ANTHROPIC_API_KEY Invoking Claude to generate code, then verifying
GitHub Action uses: moonrunnerkc/ruleprobe@v1 GITHUB_TOKEN CI, PR comments

parse, verify, compare, tasks, and task work entirely offline. No key needed.

Tree-sitter Support

Python and Go get naming and function-length checks via tree-sitter WASM grammars. The grammar packages (tree-sitter-python, tree-sitter-go, web-tree-sitter) ship as regular dependencies; no extra install step is required. WASM binaries are loaded at runtime from the installed packages. If loading fails (unsupported platform, missing native build), tree-sitter checks are skipped and other verifiers still run.

Security

RuleProbe never executes scanned code, never makes network calls (unless you opt in with --llm-extract, --rubric-decompose, or ruleprobe run), and never modifies files in the scanned directory. User-supplied paths are resolved and bounded to the working directory; symlinks outside the project are skipped unless you pass --allow-symlinks. All dependencies are pinned to exact versions. See SECURITY.md for the full model.

Limitations

What v0.1.0 doesn't do, stated plainly.

  • TypeScript gets the deepest coverage. ts-morph gives full AST analysis for TypeScript and JavaScript: naming, forbidden patterns, structure, imports, type-safety, and code-style checks. Python and Go get naming and function-length checks via tree-sitter WASM grammars (grammar packages ship as regular dependencies; see the Tree-sitter Support section). Everything else falls back to regex (line length, comments, semicolons). No Rust, Java, or C# AST support yet.
  • Subjective rules stay subjective. "Write clean code" has no deterministic check. The --rubric-decompose flag on the verify command uses an LLM to break subjective instructions into weighted concrete checks (max function length, no magic numbers, etc.), tagged with extractionMethod: 'rubric' and confidence: 'low'. This is a proxy, not a direct evaluation. Lines with no measurable proxy stay in the unparseable array. Requires OPENAI_API_KEY.
  • Agent invocation covers Claude SDK and watch mode only. The run command invokes agents via the Claude Agent SDK (requires ANTHROPIC_API_KEY) or watches a directory for output. Copilot, Cursor, and other agent SDKs are not integrated; use --watch mode for those.
  • Type-aware checks require --project. Three checks (implicit any, unused exports, unresolved imports) need the TypeChecker, which requires a tsconfig.json. Without --project, ts-morph parses files in isolation and these checks are skipped.
  • 53 matchers, not infinite. The parser skips lines it can't confidently map to a check. Use --show-unparseable to see what was missed, and --llm-extract or --rubric-decompose to handle the remainder.

Case Study

See docs/case-study-v0.1.0.md for a comparison of two agents on the rest-endpoint task template against 10 rules.

Contributing

git clone https://github.com/moonrunnerkc/ruleprobe.git
cd ruleprobe && npm install
npm test

Issues and pull requests welcome at github.com/moonrunnerkc/ruleprobe.

License

MIT