Paper: Hallucination Mitigation in LLM-Based Tool Recommendation: A Cross-Provider Architectural Ablation Study Across Two Model Generations
Authors: Lavdim Menxhiqi and Galia Marinova
Status: Under review (2026)
This repository contains the complete replication package for reproducing all results, tables, and figures reported in the paper. The study evaluates a database-grounded anti-hallucination architecture across 3 LLM providers, 2 model generations, and 6 ablation configurations, totaling 6,912 API calls with a 100% success rate.
| Finding | Detail |
|---|---|
| Architecture effectiveness | Generation-independent (~7.5 % avg hallucination rate at C5, std mode) |
| Best model | GPT-5.2 at 3.3 % hallucination rate |
| Thinking vs standard | Standard models match or beat thinking variants |
| C3 anomaly | JSON enforcement alone increases hallucination (+9-15 pp) |
| Minimum effective config | M1 + M2 (C4) captures the bulk of full-architecture benefit |
llm-tool-recommendation-replication/
|
|-- README.md # This file
|-- LICENSE # CC BY 4.0
|-- CITATION.cff # Citation metadata
|-- DATA_DICTIONARY.md # JSON result field descriptions
|-- .gitignore # Git ignore rules
|
|-- data/
| |-- results/
| | |-- phase1/ # Gen1 standard models (72 files, 2592 queries)
| | |-- phase2/ # Gen1 thinking models (24 files, 864 queries)
| | |-- phase3a/ # Gen2 standard models (72 files, 2592 queries)
| | |-- phase3b/ # Gen2 thinking models (24 files, 864 queries)
| |-- tool-inventory/
| |-- tool_inventory.json # Verified engineering tools (ground truth)
|
|-- analysis/ # Python analysis scripts (no dependencies)
| |-- analyze_results.py # Single-phase analysis
| |-- analyze_cross_generational.py # Cross-generational analysis
| |-- statistical_analysis.py # Wilcoxon, Friedman, effect sizes, CIs
| |-- classify_hallucinations.py # H1/H2/near-miss classification
| |-- compute_h2_rates.py # H2-only operational rates
| |-- compute_reviewer_metrics.py # Query-level metrics (P_any, E[H/Q])
| |-- compute_zero_diff.py # Zero-difference pair analysis
| |-- sensitivity_analysis.py # Detection threshold sweep
| |-- paired_wilcoxon_check.py # Standard vs thinking Wilcoxon tests
| |-- wilcoxon_and_expected_h.py # Combined tests and expected hallucinations
|
|-- evaluation-harness/ # .NET 8 evaluation framework (C#)
| |-- README.md # Setup instructions for re-running experiments
| |-- CadcomOnline.Evaluation.csproj # Project file
| |-- Program.cs # CLI entry point
| |-- appsettings.example.json # Configuration template (no secrets)
| |-- Clients/ # LLM provider clients
| |-- Services/ # Core evaluation services
| |-- Models/ # Data models
|
|-- figures/ # Interactive HTML visualizations
|
|-- supplementary/ # Supplementary materials
|
|-- database/ # Database schema and seed data
| |-- schema.sql # PostgreSQL table definitions
| |-- seed_tools.sql # Tool inventory insert script
|
|-- scripts/ # Utility scripts
|-- extract_tool_inventory.py # Extract tool inventory from results
|-- validate_results.py # Verify result file integrity
All analysis scripts use Python 3.10+ standard library only (no external dependencies required, except for the inferential-tests script which uses scipy).
git clone https://github.com/nauka-lm/llm-tool-recommendation-replication.git
cd llm-tool-recommendation-replicationcd analysis
python analyze_cross_generational.pypython statistical_analysis.pypython classify_hallucinations.pyOpen any HTML file from figures/ in a web browser.
| Generation | Provider | Standard Model | Thinking Model |
|---|---|---|---|
| Gen1 | OpenAI | GPT-4.1 | o4-mini |
| Gen1 | Anthropic | Claude Sonnet 4.5 | Claude Sonnet 4.5 (thinking) |
| Gen1 | Gemini 2.5 Flash-Lite | Gemini 2.5 Flash * | |
| Gen2 | OpenAI | GPT-5.2 | GPT-5.2 (reasoning_effort=high) |
| Gen2 | Anthropic | Claude Sonnet 4.6 | Claude Sonnet 4.6 (adaptive) |
| Gen2 | Gemini 3.1 Flash-Lite | Gemini 3.1 Flash-Lite (thinkingLevel=high) |
* Gen1 Google thinking uses Flash (not Flash-Lite) because Flash-Lite 2.5 does not support extended thinking. Gen2 Flash-Lite 3.1 supports both modes natively.
| Config | Context Builder (M1) | Closed Vocabulary (M2) | JSON Enforcement (M3) | Description |
|---|---|---|---|---|
| C0 | Off | Off | Off | Ungrounded baseline |
| C1 | On | Off | Off | Database context only |
| C2 | Off | On | Off | Tool name whitelist only |
| C3 | Off | Off | On | JSON format only |
| C4 | On | On | Off | Context + vocabulary |
| C5 | On | On | On | Full architecture |
| Domain | Description | Tools in DB |
|---|---|---|
| D1 | PCB Design Tool Selection | 8 |
| D2 | PCB Design Calculators | 15 |
| D3 | SMPS Design (Power Supplies, Converters, Regulators) | 8 |
| D4 | Transformer Design | 10 |
| Metric | Formula | Ideal |
|---|---|---|
| Hallucination Rate (HR) | Hallucinated tools / Total mentioned tools | 0 % |
| Grounding Rate (GR) | 100 % − HR | 100 % |
| Workflow Coverage (WC) | Distinct workflow stages / 6 | 100 % |
| Response Consistency (RC) | Jaccard similarity across repetitions | 100 % |
| Provider | Standard | Thinking |
|---|---|---|
| OpenAI (GPT-4.1 / o4-mini) | 5.3 % | 6.6 % |
| Anthropic (Claude Sonnet 4.5) | 11.3 % | 11.3 % |
| Google (Flash-Lite / Flash) | 3.8 % | 5.5 % |
| Cross-provider average | 6.8 % | 7.8 % |
| Provider | Standard | Thinking |
|---|---|---|
| OpenAI (GPT-5.2) | 3.3 % | 3.3 % |
| Anthropic (Claude Sonnet 4.6) | 13.3 % | 14.9 % |
| Google (Gemini 3.1 Flash-Lite) | 5.9 % | 5.2 % |
| Cross-provider average | 7.5 % | 7.8 % |
The cross-provider Gen1↔Gen2 averages under the full architecture (C5) differ by less than one percentage point on standard mode (6.8 % → 7.5 %), supporting the generational-stability finding across both model generations.
If you want to re-execute the 6,912 API calls (not required for verifying claims):
- .NET 8.0 SDK
- PostgreSQL 14+
- API keys for OpenAI, Anthropic, and Google
-
Set up the database:
psql -U postgres -d your_database -f database/schema.sql psql -U postgres -d your_database -f database/seed_tools.sql
-
Configure API keys:
cd evaluation-harness cp appsettings.example.json appsettings.json # Edit appsettings.json with your API keys and database connection
-
Build and run:
dotnet build dotnet run
See evaluation-harness/README.md for detailed instructions.
| Phase | API Calls | Estimated Cost | Estimated Time |
|---|---|---|---|
| Phase 1 (Gen1 standard) | 2,592 | ~$15 | ~3 hours |
| Phase 2 (Gen1 thinking) | 864 | ~$8 | ~1.5 hours |
| Phase 3a (Gen2 standard) | 2,592 | ~$25 | ~3 hours |
| Phase 3b (Gen2 thinking) | 864 | ~$15 | ~1.5 hours |
| Total | 6,912 | ~$63 | ~9 hours (sequential) |
The CLI mode supports parallel execution per (provider, category, domain) to compress wall-clock time.
| Script | Paper Section | Description |
|---|---|---|
analyze_results.py |
Tables 6–8 | Single-phase ablation analysis |
analyze_cross_generational.py |
Tables 9–13, Figs 7–9 | Cross-generational comparison |
statistical_analysis.py |
Sections IV–V | Wilcoxon signed-rank, Friedman, effect sizes |
classify_hallucinations.py |
Section V.4 | H1/H2/near-miss/non-specific classification |
compute_h2_rates.py |
Section V.4 | H2-only operational hallucination rates |
compute_reviewer_metrics.py |
Tables 7, 12 | P_any, P_any_H2, E[H/Q] metrics |
compute_zero_diff.py |
Section III.7 | Zero-difference pair analysis |
sensitivity_analysis.py |
Section V.8 | Detection threshold sensitivity sweep |
paired_wilcoxon_check.py |
Section IV.4 | Standard vs thinking Wilcoxon tests |
wilcoxon_and_expected_h.py |
Tables 7, 12 | Combined statistical tests |
All scripts read from ../data/results/ and use only Python standard library modules (except statistical_analysis.py which uses scipy).
This replication package is released under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
- Lavdim Menxhiqi — lavdim.menxhiqi@ubt-uni.net (Technical University of Sofia, Bulgaria)
- Galia Marinova — gim@tu-sofia.bg (Technical University of Sofia, Bulgaria)