Skip to content

ab75173/procurement-agent-evals

Repository files navigation

procurement-agent-evals

CI Python License: MIT

A bid/no-bid agent for federal contract solicitations, paired with a functional eval harness that grades the agent's recommendations against a labeled dataset.

The interesting part isn't the agent — it's the harness. Anyone can prompt an LLM to say "bid" or "no bid." The question a buyer (or a hiring manager) actually cares about is: how often is it right, and where does it fail? This repo answers that with a confusion matrix and precision/recall, not a vibe check.

What it does

Given a company profile (NAICS codes, capabilities, set-aside certifications, contract capacity) and a solicitation, the agent returns a structured recommendation:

Recommendation(
    decision=Decision.BID,
    confidence=Confidence.HIGH,
    rationale="Strong cybersecurity fit; eligible SBA set-aside; within capacity.",
    key_factors=["NAICS 541512 matches", "capability match in scope of work"],
)

The eval harness runs the agent over a labeled set of solicitations and reports accuracy, precision, recall, F1, and a confusion matrix.

Design

  • Pluggable backends behind one interface (Backend):
    • AnthropicBackend — calls Claude (claude-opus-4-7) with adaptive thinking, a prompt-cached company profile, and Pydantic-typed structured output.
    • HeuristicBackend — a transparent rule-based baseline. Deterministic, offline, and a useful yardstick to measure the LLM against.
  • Offline by default. The agent's LLM call sits behind the Backend interface, so the test suite and CI run entirely on the heuristic backend — no API key, no network.
  • Composes with sam-gov-mcp. The Solicitation model mirrors that server's normalized opportunity shape, so opportunities pulled from SAM.gov can feed straight into this agent — without a hard dependency between the two repos.

Install

git clone https://github.com/ab75173/procurement-agent-evals.git
cd procurement-agent-evals
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Run the evals

Offline heuristic baseline (no key needed):

procurement-agent-evals --backend heuristic --format md

Against Claude (needs ANTHROPIC_API_KEY):

export ANTHROPIC_API_KEY=...        # or put it in a gitignored .env
procurement-agent-evals --backend anthropic --model claude-opus-4-7

Sample output (heuristic baseline):

# Bid/No-Bid Eval — `heuristic` backend

**Accuracy:** 92.9%  (13/14)
**Precision:** 87.5%  •  **Recall:** 100.0%  •  **F1:** 0.93

| | predicted bid | predicted no_bid |
|---|---|---|
| **actual bid** | 7 | 0 |
| **actual no_bid** | 1 | 6 |

The heuristic's single miss is the deliberate trap case (SOL-014): a hardware-resale requirement filed under a services NAICS code. It matches on paper but is the wrong scope of work — exactly the kind of judgment call an LLM backend should get right where keyword screening doesn't.

The dataset

data/company_profile.json plus data/solicitations.jsonl — 14 solicitations, each with a ground-truth bid/no_bid label and a one-line rationale. The set is balanced (7/7) and spans eligibility traps (set-asides the company can't claim), out-of-scope work, oversized contracts, and adjacent-NAICS judgment calls. Point the CLI at your own data with --data-dir.

Develop

pytest        # full suite, fully offline
ruff check .  # lint

License

MIT — see LICENSE.

About

A bid/no-bid procurement agent for federal solicitations with a functional eval harness

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages