procurement-agent-evals

A bid/no-bid agent for federal contract solicitations, paired with a functional eval harness that grades the agent's recommendations against a labeled dataset.

The interesting part isn't the agent — it's the harness. Anyone can prompt an LLM to say "bid" or "no bid." The question a buyer (or a hiring manager) actually cares about is: how often is it right, and where does it fail? This repo answers that with a confusion matrix and precision/recall, not a vibe check.

What it does

Given a company profile (NAICS codes, capabilities, set-aside certifications, contract capacity) and a solicitation, the agent returns a structured recommendation:

Recommendation(
    decision=Decision.BID,
    confidence=Confidence.HIGH,
    rationale="Strong cybersecurity fit; eligible SBA set-aside; within capacity.",
    key_factors=["NAICS 541512 matches", "capability match in scope of work"],
)

The eval harness runs the agent over a labeled set of solicitations and reports accuracy, precision, recall, F1, and a confusion matrix.

Design

Pluggable backends behind one interface (Backend):
- AnthropicBackend — calls Claude (claude-opus-4-7) with adaptive thinking, a prompt-cached company profile, and Pydantic-typed structured output.
- HeuristicBackend — a transparent rule-based baseline. Deterministic, offline, and a useful yardstick to measure the LLM against.
Offline by default. The agent's LLM call sits behind the Backend interface, so the test suite and CI run entirely on the heuristic backend — no API key, no network.
Composes with sam-gov-mcp. The Solicitation model mirrors that server's normalized opportunity shape, so opportunities pulled from SAM.gov can feed straight into this agent — without a hard dependency between the two repos.

Install

git clone https://github.com/ab75173/procurement-agent-evals.git
cd procurement-agent-evals
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

Run the evals

Offline heuristic baseline (no key needed):

procurement-agent-evals --backend heuristic --format md

Against Claude (needs ANTHROPIC_API_KEY):

export ANTHROPIC_API_KEY=...        # or put it in a gitignored .env
procurement-agent-evals --backend anthropic --model claude-opus-4-7

Sample output (heuristic baseline):

# Bid/No-Bid Eval — `heuristic` backend

**Accuracy:** 92.9%  (13/14)
**Precision:** 87.5%  •  **Recall:** 100.0%  •  **F1:** 0.93

| | predicted bid | predicted no_bid |
|---|---|---|
| **actual bid** | 7 | 0 |
| **actual no_bid** | 1 | 6 |

The heuristic's single miss is the deliberate trap case (SOL-014): a hardware-resale requirement filed under a services NAICS code. It matches on paper but is the wrong scope of work — exactly the kind of judgment call an LLM backend should get right where keyword screening doesn't.

The dataset

data/company_profile.json plus data/solicitations.jsonl — 14 solicitations, each with a ground-truth bid/no_bid label and a one-line rationale. The set is balanced (7/7) and spans eligibility traps (set-asides the company can't claim), out-of-scope work, oversized contracts, and adjacent-NAICS judgment calls. Point the CLI at your own data with --data-dir.

Develop

pytest        # full suite, fully offline
ruff check .  # lint

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
src/procurement_agent_evals		src/procurement_agent_evals
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

procurement-agent-evals

What it does

Design

Install

Run the evals

The dataset

Develop

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

procurement-agent-evals

What it does

Design

Install

Run the evals

The dataset

Develop

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages