A bid/no-bid agent for federal contract solicitations, paired with a functional eval harness that grades the agent's recommendations against a labeled dataset.
The interesting part isn't the agent — it's the harness. Anyone can prompt an LLM to say "bid" or "no bid." The question a buyer (or a hiring manager) actually cares about is: how often is it right, and where does it fail? This repo answers that with a confusion matrix and precision/recall, not a vibe check.
Given a company profile (NAICS codes, capabilities, set-aside certifications, contract capacity) and a solicitation, the agent returns a structured recommendation:
Recommendation(
decision=Decision.BID,
confidence=Confidence.HIGH,
rationale="Strong cybersecurity fit; eligible SBA set-aside; within capacity.",
key_factors=["NAICS 541512 matches", "capability match in scope of work"],
)The eval harness runs the agent over a labeled set of solicitations and reports accuracy, precision, recall, F1, and a confusion matrix.
- Pluggable backends behind one interface (
Backend):AnthropicBackend— calls Claude (claude-opus-4-7) with adaptive thinking, a prompt-cached company profile, and Pydantic-typed structured output.HeuristicBackend— a transparent rule-based baseline. Deterministic, offline, and a useful yardstick to measure the LLM against.
- Offline by default. The agent's LLM call sits behind the
Backendinterface, so the test suite and CI run entirely on the heuristic backend — no API key, no network. - Composes with
sam-gov-mcp. TheSolicitationmodel mirrors that server's normalized opportunity shape, so opportunities pulled from SAM.gov can feed straight into this agent — without a hard dependency between the two repos.
git clone https://github.com/ab75173/procurement-agent-evals.git
cd procurement-agent-evals
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"Offline heuristic baseline (no key needed):
procurement-agent-evals --backend heuristic --format mdAgainst Claude (needs ANTHROPIC_API_KEY):
export ANTHROPIC_API_KEY=... # or put it in a gitignored .env
procurement-agent-evals --backend anthropic --model claude-opus-4-7Sample output (heuristic baseline):
# Bid/No-Bid Eval — `heuristic` backend
**Accuracy:** 92.9% (13/14)
**Precision:** 87.5% • **Recall:** 100.0% • **F1:** 0.93
| | predicted bid | predicted no_bid |
|---|---|---|
| **actual bid** | 7 | 0 |
| **actual no_bid** | 1 | 6 |
The heuristic's single miss is the deliberate trap case (SOL-014): a hardware-resale
requirement filed under a services NAICS code. It matches on paper but is the wrong scope of
work — exactly the kind of judgment call an LLM backend should get right where keyword
screening doesn't.
data/company_profile.json plus data/solicitations.jsonl — 14 solicitations, each with a
ground-truth bid/no_bid label and a one-line rationale. The set is balanced (7/7) and
spans eligibility traps (set-asides the company can't claim), out-of-scope work, oversized
contracts, and adjacent-NAICS judgment calls. Point the CLI at your own data with
--data-dir.
pytest # full suite, fully offline
ruff check . # lintMIT — see LICENSE.