Built-in evaluation system for measuring and improving RAG quality.
QuantumRAG includes a comprehensive evaluation system that measures retrieval and generation quality across 6 metrics. It supports both synthetic QA generation and custom benchmark files.
from quantumrag import Engine
engine = Engine()
engine.ingest("./docs")
result = engine.evaluate()
print(result.summary)
for metric in result.metrics:
print(f" {metric.name}: {metric.score:.2f}")
for suggestion in result.suggestions:
print(f" - {suggestion}")| Metric | Description | Range |
|---|---|---|
| retrieval_recall | Percentage of relevant chunks retrieved | 0.0 - 1.0 |
| faithfulness | Percentage of answer supported by retrieved sources | 0.0 - 1.0 |
| answer_relevancy | Percentage of answer that directly addresses the query | 0.0 - 1.0 |
| completeness | Percentage of query aspects covered in the answer | 0.0 - 1.0 |
| latency | Average query processing time | seconds |
| cost | Average cost per query | USD |
Retrieval Recall: Measures whether the retrieval pipeline finds the right chunks. High recall means relevant information isn't being missed.
Faithfulness: Measures hallucination risk. Every claim in the answer should be supported by a retrieved source. Low faithfulness indicates the model is generating unsupported claims.
Answer Relevancy: Measures how directly the answer addresses the query. Low relevancy indicates off-topic or overly broad responses.
Completeness: Measures whether all aspects of a multi-part question are addressed. Low completeness indicates partial answers.
When no benchmark file is provided, QuantumRAG auto-generates QA pairs from indexed documents:
result = engine.evaluate(sample_count=20)The synthetic QA generator:
- Selects representative document sections
- Uses LLM to generate questions per section
- Extracts expected answers from chunk content
- Classifies difficulty (simple / medium / hard)
Provide your own QA benchmark as JSON:
[
{
"question": "What chunking strategies does QuantumRAG support?",
"expected_answer": "auto, structural, semantic, fixed",
"reference_chunks": ["chunk_id_1", "chunk_id_2"]
},
{
"question": "Who is the CTO?",
"expected_answer": "Jane Doe",
"reference_chunks": ["chunk_id_3"]
}
]result = engine.evaluate(benchmark_file="benchmark.json")Via CLI:
quantumrag evaluate --benchmark benchmark.jsonVia API:
curl -X POST http://localhost:8000/v1/evaluate \
-H "Content-Type: application/json" \
-d '{"benchmark_file": "benchmark.json", "sample_count": 50}'Compare two configurations side by side:
from quantumrag.core.evaluate.evaluator import Evaluator
evaluator = Evaluator(engine)
comparison = await evaluator.compare(
config_a={"retrieval": {"top_k": 5}},
config_b={"retrieval": {"top_k": 10}},
sample_count=20,
)QuantumRAG includes a comprehensive E2E scenario test suite with 87 test cases across 16 categories and 4 difficulty levels.
| # | Category | Tests | Description |
|---|---|---|---|
| S1 | Factual Confirmation | 7 | Basic fact retrieval, personnel, dates |
| S2 | Multi-Hop Reasoning | 6 | Cross-document information fusion |
| S3 | Numerical Calculations | 6 | Math, percentages, comparisons |
| S4 | Temporal Reasoning | 6 | Timeline, changelog, version tracking |
| S5 | Negation/Exclusion | 5 | "Not supported", incomplete features |
| S6 | Cross-Document Synthesis | 5 | Multi-source data integration |
| S7 | Paraphrase Robustness | 6 | Colloquial and rephrased queries |
| S8 | Multi-Turn Conversation | 5 | Coreference resolution, entity tracking |
| S9 | Edge Cases | 7 | Boundary inputs, adversarial queries |
| S10 | Precision Search | 6 | Fine-grained detail extraction |
| S11 | Implicit Inference | 5 | Information not directly stated |
| S12 | Competitive Analysis | 3 | Market positioning, competitor comparison |
| S13 | Conditional Reasoning | 5 | IF/THEN scenarios, sufficiency checks |
| S14 | Multi-Constraint Filtering | 5 | Multiple criteria intersection |
| S15 | Derived Quantitative | 5 | Calculations from multiple sources |
| S16 | Cross-Verification | 4 | Consistency checks across documents |
| Level | Count | Description |
|---|---|---|
| Easy | 18 | Single-hop factual queries |
| Medium | 37 | Multi-step reasoning |
| Hard | 21 | Cross-document synthesis |
| Extreme | 6 | Complex conditional + aggregation |
uv run python tests/run_scenario_tests.pyOutput includes per-scenario pass/fail with latency and confidence details.
A structured approach to RAG validation using real-world web content.
Each dataset tests specific RAG capabilities in isolation:
.venv/bin/python datasets/run_qa.py ds-001 # Run specific dataset
.venv/bin/python datasets/run_qa.py # Auto-select latestFeatures:
- Ingest caching: Skip re-ingest if sources unchanged (SHA256 hash check)
- Per-query timeout: 120s limit prevents slow queries from blocking
- Parallel execution: 3 concurrent queries
- Auto-graduation: When pass_rate >= threshold for min_runs, status updates automatically
| Dataset | Focus | Sources | Questions | Threshold |
|---|---|---|---|---|
| ds-001 | Multilingual + numerical precision | 4 | 20 | 85% |
| ds-002 | Type system + cross-topic confusion | 6 | 25 | 80% |
| ds-003 | Dense technical + cross-document | 7 | 30 | 75% |
| ds-004 | Table extraction + contradiction detection | 6 | 30 | 75% |
Merges all datasets into a single corpus to test retrieval under noise — what individual tests cannot catch.
.venv/bin/python datasets/run_qa_combined.pyKey metrics:
- Retrieval Recall: Did the engine find chunks from the correct source documents?
- Noise Ratio: What fraction of retrieved chunks are from irrelevant sources?
- Degradation: How much does pass rate drop vs individual tests?
Baseline results (23 sources, ~300 chunks, 105 questions):
- Individual: 83% avg → Combined: 29% (-54% degradation)
- Retrieval Recall: 9% — confirms retrieval is the key bottleneck at scale
- 68/75 failures are retrieval-caused, not generation-caused
/qa-create → /qa-run → /qa-analyze → /qa-improve → /qa-run (verify) → graduated
Full status: datasets/STATUS.md
evaluation:
auto_synthetic: true # Auto-generate QA pairs when no benchmark provided
metrics:
- "retrieval_recall"
- "faithfulness"
- "answer_relevancy"
- "completeness"
- "latency"
- "cost"