tool-forge

Post-train a small instruction-tuned LLM for reliable multi-step tool use, evaluate it on the public Berkeley Function Calling Leaderboard (BFCL-V4), and serve it behind an OpenAI-compatible API with a minimal ReAct loop — all on a single 12 GB RTX 4070.

Base model: Qwen3-4B-Instruct-2507 (non-thinking, function-calling mode). The résumé artifact is a base → SFT → aligned BFCL accuracy table.

The constraint drives everything

One consumer GPU. On a desktop-driving 4070, ~2 GB goes to the display, leaving ~10 GB usable. Every choice (QLoRA 4-bit, adapter-disabled reference model, short context, batch-size-1 + grad accumulation) exists to fit that budget. Measured: bf16 Qwen3-4B-Instruct-2507 serving uses ~7.64 GiB weights + ~2.5 GiB KV cache (18.5k tokens) at gpu_memory_utilization=0.90, --max-model-len 12288.

Pipeline (do not reorder)

Baseline — serve the untouched base, run BFCL-V4, record per-category accuracy before any training.
Data — normalize xLAM/ToolACE to one schema, render to the chat template, validate every gold call through the verifier.
SFT — QLoRA, train-on-completions (mask prompt tokens).
Alignment — DPO (mine near-miss preference pairs) → GRPO (verifiable reward) as a stretch.
Eval — BFCL per-category + custom metrics (JSON-validity %, schema-compliance %, hallucinated-tool rate).
Serve + agent — merge adapter, serve via vLLM, run a ReAct loop.

Architecture

The spine is schema.py — every module speaks its types; nothing speaks "xLAM-ese" past normalize. The pure core (schema / verify / normalize / split / format) is I/O-free and unit-tested; the edge (HF load, W&B, vLLM, trainers) is where side effects live.

Data flow (✅ built · 🔨 in progress · ⬜ planned):

flowchart LR
    XLAM[("xLAM rows<br/>HF, gated")]
    NORM["normalize.py ✅<br/>row → Conversation"]
    VERIFY["verify.py ✅<br/>validate gold calls"]
    SPLIT["split.py ✅<br/>seeded → train/dev/test"]
    FORMAT["format.py ✅<br/>→ chat template"]
    WANDB[("W&B artifact")]
    SCHEMA["schema.py ✅<br/>shared types"]

    XLAM --> NORM --> VERIFY --> SPLIT --> FORMAT --> WANDB
    SCHEMA -. types .-> NORM
    SCHEMA -. types .-> VERIFY
    SCHEMA -. types .-> SPLIT
    SCHEMA -. types .-> FORMAT
    VERIFY -. quarantine counts<br/>converter feedback .-> NORM

verify.py is reused three times — Phase 1 (filter gold), Phase 3 (grade generated calls, strict), Phase 5 (guard agent calls) — which is why it stays pure and dependency-free.

Core types (schema.py):

classDiagram
    class ToolSpec {
        +str name
        +str description
        +dict parameters
    }
    class ToolCall {
        +str name
        +dict arguments
        +str id
    }
    class Conversation {
        +int id
        +str query
        +dict~str,ToolSpec~ tools
        +tuple~ToolCall~ gold_calls
    }
    class PreferencePair {
        +str query
        +dict~str,ToolSpec~ tools
        +ToolCall chosen
        +ToolCall rejected
        +VerificationOutcome rejection_reason
    }
    class VerificationResult {
        +VerificationOutcome result
        +str detail
    }
    Conversation o-- ToolSpec : tools
    Conversation o-- ToolCall : gold_calls
    PreferencePair o-- ToolSpec : tools
    PreferencePair o-- ToolCall : chosen / rejected
    VerificationResult ..> VerificationOutcome

Conversation (SFT) and PreferencePair (DPO) are siblings — both carry query + tools; one holds gold calls, the other a chosen/rejected pair.

Results

Base model Qwen/Qwen3-4B-Instruct-2507-FC, untouched, on BFCL-V4. Repro: greedy (temperature=0.0), bfcl-eval==2025.12.17, vLLM 0.21.0, bf16, --max-model-len 12288.

Stage	Non-Live AST	Live	Multi-turn	Overall (full V4)
base	88.31%	76.31%	17.50%	partial¹
+ SFT	—	—	—	—
+ aligned	—	—	—	—

¹ BFCL's single "Overall" blends sections not yet run (multi-turn, agentic), so it is not meaningful until those land — track the section columns, not the headline. Agentic (web search, memory) is excluded by design (out of training distribution; non-deterministic external services).

Base per-category breakdown (single-turn)

Non-Live (hand-written)	Acc	Live (real prompts)	Acc
simple_python	95.25%	live_simple	78.68%
simple_java	64.00%	live_multiple	76.16%
simple_javascript	68.00%	live_parallel	62.50%
multiple	94.50%	live_parallel_multiple	66.67%
parallel	93.00%	live_relevance	87.50%
parallel_multiple	90.00%	live_irrelevance	81.22%
irrelevance	89.17%

Reading: already strong on Python AST (90–95%); weak on non-Python (java 64%, javascript 68%) and live parallel calls (62–67%) — the headroom SFT/DPO must target.

Base per-category breakdown (multi-turn)

Multi-turn category	Acc
base	25.50%
miss_func	21.50%
miss_param	17.50%
long_context	5.50%
overall	17.50%

Reading: the expected difficulty gradient — base highest, long_context lowest. Stateful multi-step execution is the hardest BFCL section for a 4B; the clean spread (not a flat near-zero) confirms the harness is scoring real behavior. This is the biggest headroom for SFT/alignment.

Status

✅ Scaffold: uv (Python 3.12.11), ruff + mypy --strict + pytest, all green.
✅ Pure core: schema.py (ToolSpec / ToolCall / PreferencePair), verify.py (JSON-Schema verifier), full tests.
✅ GPU verified, vLLM serving Qwen3-4B-Instruct-2507 confirmed on WSL2.
✅ Baseline (single-turn): Non-Live AST 88.31%, Live 76.31% (Qwen3-4B-Instruct-2507-FC, greedy).
✅ Baseline (multi-turn): Overall 17.50% (base 25.50% / miss_func 21.50% / miss_param 17.50% / long_context 5.50%).
🔨 Phase 1 (data pipeline): xLAM normalize → Conversation done — full-corpus gold-call validity 98.43% (1.57% genuine xLAM noise quarantined via verify). split.py (seeded sklearn) + format.py (byte-exact Qwen3-FC render) done. W&B artifact build next.

Setup

uv sync                 # core deps + dev tools (ruff, mypy, pytest)
uv run ruff check . && uv run mypy && uv run pytest

Serving + baseline eval (WSL2 — separate from the training env)

vLLM is installed imperatively (not in the lockfile) because serving and training have conflicting torch pins. The self-healing recipe lives in scripts/; run from a clean GPU:

uv pip install "vllm==0.21.0" --torch-backend auto
./scripts/serve_baseline.sh                  # terminal A: serve, wait for "Application startup complete"
./scripts/run_baseline_auto.sh single_turn   # terminal B: generate (resume loop survives WSL hangs)
./scripts/eval_bfcl.sh single_turn           # score → bfcl_scores/ (CPU-only; reads sealed gold)

Hard-won WSL2 pins, all encoded in serve_baseline.sh:

vllm==0.21.0 — 0.23 + CUDA 13 crashes on WSL2 (UVA unavailable).
VLLM_USE_FLASHINFER_SAMPLER=0 and no --kv-cache-dtype fp8 — anything that makes vLLM JIT-compile a FlashInfer kernel needs nvcc, which isn't installed. fp8 forces FlashInfer; bf16 KV stays on prebuilt FLASH_ATTN.
--max-model-len 12288 (not 4096) — BFCL reserves up to 4096 output tokens against the model's native 262K context, so the served window must hold prompt + 4096; 4096 overflows real prompts.
--gpu-memory-utilization 0.90 --enforce-eager — fits bf16 weights + KV in ~10 GB usable.
LD_LIBRARY_PATH → venv nvidia/*/lib — CUDA libs ship as pip wheels in the venv; after a WSL/shell restart the linker can't find libcudart.so.13 until this is re-exported.
Reap orphans after a crash — vllm serve spawns an EngineCore child that Ctrl-C does not kill (it keeps holding VRAM): pkill -9 -f 'EngineCore' (also bin/vllm, bin/bfcl).

Eval is pinned to bfcl-eval==2025.12.17. --skip-server-setup points BFCL at the running server; greedy decoding (--temperature 0.0) for reproducibility. Result/score dirs use absolute paths because BFCL resolves relative --result-dir/--score-dir against its package root, not your CWD. Agentic categories (web search, memory) are excluded — they need live external services.

Stack

Python 3.12 · uv · transformers + trl + peft + bitsandbytes (Unsloth backend) · vllm · bfcl-eval · pydantic v2 · jsonschema. Quality gates: ruff, mypy --strict, pytest.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
bfcl_scores		bfcl_scores
eda		eda
scripts		scripts
src/tool_forge		src/tool_forge
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LLM_Project-Function-Calling-Post-Training.md		LLM_Project-Function-Calling-Post-Training.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tool-forge

The constraint drives everything

Pipeline (do not reorder)

Architecture

Results

Status

Setup

Serving + baseline eval (WSL2 — separate from the training env)

Stack

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tool-forge

The constraint drives everything

Pipeline (do not reorder)

Architecture

Results

Status

Setup

Serving + baseline eval (WSL2 — separate from the training env)

Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages