A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.
- Multi-field validation - Binary (True/False), scalar (single values), and list (multiple values) data types
- Partial labeling support - Handle datasets where different cases have labels for different subsets of fields
- Dual usage modes - Validate pre-computed results OR run live LLM inference with validation
- Comprehensive metrics - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
- Confidence analysis - Automatic performance breakdown by confidence levels
- Statistical uncertainty - Non-parametric bootstrap confidence intervals for all performance metrics
- Production ready - Parallel processing, intelligent caching, detailed progress tracking
# Install from PyPI
pip install llmvalidate
# OR install from source
pip install -r requirements.txt # Python 3.11+ requiredpython runme.pyProcesses the included samples.csv (14 test cases covering all validation scenarios) and outputs timestamped results to validation_results/samples/:
- Results CSV - Row-by-row comparison with confusion matrix counts and item-level details
- Metrics CSV - Aggregated performance statistics with confidence breakdowns
- CI Metrics CSV - Confidence intervals for metrics
| Rows | Field Type | Test Scenarios |
|---|---|---|
| 1-4 | Binary (Has metastasis) |
True Positive, True Negative, False Positive, False Negative |
| 5-9 | Scalar (Diagnosis, Histology) |
Correct, incorrect, missing, spurious, and empty extractions |
| 10-14 | List (Treatment Drugs, Test Results) |
Perfect match, spurious items, missing items, correct empty, mixed results |
When you have LLM predictions in Res: {Field Name} columns:
import pandas as pd
from src.validation import validate
df = pd.read_csv("data.csv", index_col="Patient ID")
# df must contain: "Field Name" and "Res: Field Name" columns
results_df, metrics_df = validate(
source_df=df,
fields=["Diagnosis", "Treatment"], # or None for auto-detection
structure_callback=None,
output_folder="validation_results"
)from src.structured import StructuredResult, StructuredGroup, StructuredField
from src.utils import flatten_structured_result
def llm_callback(row, i, raw_text_column_name):
raw_text = row[raw_text_column_name]
# Your LLM inference logic here
result = StructuredResult(
groups=[StructuredGroup(
group_name="medical",
fields=[
StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
]
)]
)
return flatten_structured_result(result), {}
results_df, metrics_df = validate(
source_df=df,
fields=["Diagnosis", "Treatment"],
structure_callback=llm_callback,
raw_text_column_name="medical_report",
output_folder="validation_results",
max_workers=4
)- Unique index - Each row must have a unique identifier (e.g., "Patient ID")
- Label columns - Ground truth values for each field you want to validate
- Result columns (Mode 1 only) - LLM predictions as
Res: {Field Name}columns - Raw text column (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")
| Type | Description | Label Examples | Result Examples |
|---|---|---|---|
| Binary | True/False detection | True, False |
True, False |
| Scalar | Single text/numeric value | "Lung Cancer" 42 |
"Breast Cancer" 38 |
| List | Multiple values | ["Drug A", "Drug B"] "['Item1', 'Item2']" |
["Drug A"] [] |
"-"= Labeled as "No information is available in the source document"null/empty/NaN= Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)- Lists - Can be Python lists
["a", "b"]or stringified"['a', 'b']"(auto-converted)
The framework supports partial labeling scenarios where:
- Not every case needs labels for every field
- Different cases can have labels for different subsets of fields
- Missing labels (
null/NaN) are handled gracefully in all metrics calculations - Use
"-"when the document explicitly lacks information about a field - Use
null/NaNwhen the field simply wasn't labeled for that case
The framework generates two timestamped CSV files for each validation run:
Row-level analysis with detailed per-case metrics:
Original Data:
- All input columns (labels, raw text, etc.)
Res: {Field}columns with LLM predictionsRes: {Field} confidenceandRes: {Field} justification(if available)
Binary Fields:
TP/FP/FN/TN: {Field}- Confusion matrix counts (1 or 0 per row)
Non-Binary Fields:
Cor/Inc/Mis/Spu: {Field}- Item counts per rowCor/Inc/Mis/Spu: {Field} items- Actual item listsPrecision/Recall/F1/F2: {Field}- Per-row metrics (list fields only)
System Columns:
Sys: from cache- Whether result was cached (speeds up duplicate text)Sys: exception- Error information if processing failedSys: time taken- Processing time per row in seconds
Aggregated statistics with confidence breakdowns:
Core Information:
field- Field name being evaluatedconfidence- Confidence level ("Overall", "High", "Medium", "Low", etc.)labeled cases- Total rows with ground truth labelsfield-present cases- Rows where document has information about the field (label is not '-')
Binary Metrics: TP, TN, FP, FN, precision, recall, F1/F2, accuracy, specificity
Non-Binary Metrics: cor, inc, mis, spu, precision/recall/F1/F2 (micro), precision/recall/F1/F2 (macro)
For fields with True/False values (e.g., "Has metastasis"):
| Count | Definition | Example |
|---|---|---|
| TP (True Positive) | Correctly predicted positive | Label: True, Prediction: True → TP=1 |
| TN (True Negative) | Correctly predicted negative | Label: False, Prediction: False → TN=1 |
| FP (False Positive) | Incorrectly predicted positive | Label: False, Prediction: True → FP=1 |
| FN (False Negative) | Incorrectly predicted negative | Label: True, Prediction: False → FN=1 |
| Metric | Formula | Meaning |
|---|---|---|
| Precision | TP / (TP + FP) |
Of all positive predictions, how many were correct? |
| Recall | TP / (TP + FN) |
Of all actual positives, how many were found? |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
Overall percentage of correct predictions |
| Specificity | TN / (TN + FP) |
Of all actual negatives, how many were correctly identified? |
For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):
| Count | Definition | Example |
|---|---|---|
| Correct (Cor) | Items extracted correctly | Label: ["DrugA", "DrugB"], Prediction: ["DrugA"] → Cor=1 |
| Missing (Mis) | Items present in label but not extracted | (Same example) → Mis=1 (DrugB missing) |
| Spurious (Spu) | Items extracted but not in label | Label: ["DrugA"], Prediction: ["DrugA", "DrugC"] → Spu=1 |
| Incorrect (Inc) | Wrong values for scalar fields | Label: "Cancer", Prediction: "Diabetes" → Inc=1 |
| Metric | Formula | Meaning |
|---|---|---|
| Precision | Cor / (Cor + Spu + Inc) |
Of all extracted items, how many were correct? |
| Recall | Cor / (Cor + Mis + Inc) |
Of all labeled items, how many were correctly extracted? |
Note: For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.
The following formulas apply to both binary classification and structured extraction metrics:
| Metric | Formula | Meaning |
|---|---|---|
| F1 Score | 2 × (P × R) / (P + R) |
Balanced harmonic mean of precision and recall |
| F2 Score | 5 × (P × R) / (4P + R) |
Recall-weighted F-score (emphasizes recall over precision) |
Where P = Precision and R = Recall (calculated differently for each metric type).
The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.
from src.validation import bootstrap_CI
# After running validation to get results_df
ci_results = bootstrap_CI(
res_df=results_df, # Results from validate() function
fields=["diagnosis", "treatment"], # Fields to analyze (or None for auto-detect)
n_bootstrap=5000, # Number of bootstrap samples (default: 5000)
ci=0.95, # Confidence level (default: 0.95 for 95% CI)
random_state=42 # For reproducible results
)- Resampling unit: Individual cases (not individual predictions)
- Resampling strategy: Sample with replacement to preserve original dataset size
- CI calculation: Percentile method using bootstrap distribution
- Partial labeling: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
- Metrics included: All validation metrics (precision, recall, F1, accuracy, etc.)
The bootstrap_CI() function returns a DataFrame with confidence intervals for each field:
| Column | Description |
|---|---|
field |
Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters) |
labeled cases |
Number of labeled cases in the dataset |
{metric}: mean |
Bootstrap mean estimate |
{metric}: lower |
Lower bound of confidence interval |
{metric}: upper |
Upper bound of confidence interval |
Example output:
field labeled cases precision (micro): mean precision (micro): lower precision (micro): upper
0 exceptions 1000 NaN NaN NaN
1 diagnosis 1000 0.82 0.79 0.85
2 treatment 1000 0.91 0.88 0.94
3 N=5000; CI=95% NaN NaN NaN NaN
The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).
- Performance assessment: Quantify uncertainty in reported metrics
- Model comparison: Determine if performance differences are statistically significant
- Sample size planning: Understand precision of estimates with current dataset size
- Publication: Report confidence intervals alongside point estimates
validate(
source_df=df,
fields=["diagnosis", "treatment"],
structure_callback=callback,
max_workers=None, # Auto-detect CPU count (or specify number)
use_threads=True # True for I/O-bound (LLM API calls), False for CPU-bound
)- Automatic caching - Identical raw text inputs are deduplicated and cached
- Progress tracking - Real-time progress bar for long-running validations
- Cache statistics - Check
Sys: from cachecolumn in results to monitor cache hits
When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects Res: {Field} confidence columns and generates:
- Separate metrics for each unique confidence level found in your data
- Overall metrics aggregating across all confidence levels
- Useful for setting confidence thresholds and analyzing prediction reliability
# Install development dependencies
pip install -r requirements.txt
# Run all tests
pytest
# Run with coverage reporting
pytest --cov=src
# Run specific test modules
pytest tests/validate_test.py # Core validation logic
pytest tests/compare_results_test.py # Comparison algorithms
pytest tests/compare_results_all_test.py # End-to-end comparisonsllm-validation-framework/
├── src/
│ ├── validation.py # Main validation pipeline and metrics calculation
│ ├── structured.py # Pydantic data models for LLM results
│ ├── utils.py # Utility functions (list conversion, flattening)
│ └── standardize.py # Data standardization helpers
├── tests/ # Comprehensive test suite
├── validation_results/ # Output directory (auto-created)
├── samples.csv # Demo dataset with all validation scenarios
├── runme.py # Demo script
└── requirements.txt # Dependencies (pandas, pydantic, tqdm, etc.)
| Error | Solution |
|---|---|
| "Cannot infer fields" | Ensure DataFrame has both {Field} and Res: {Field} columns when structure_callback=None |
| "Missing fields" | Verify fields parameter contains column names that exist in your DataFrame |
| "Duplicate index" | Use df.reset_index(drop=True) or ensure your DataFrame index has unique values |
| Import/dependency errors | Run pip install -r requirements.txt and verify Python 3.11+ |
| Slow performance | Enable parallel processing with max_workers=None and use_threads=True for LLM API calls |
This project is licensed under the MIT License - see the LICENSE file for details.