LLM Validation Framework

A comprehensive Python framework for evaluating LLM-extracted structured data against ground truth labels. Supports binary classification, scalar values, and list fields with detailed performance metrics, confidence-based evaluation, and statistical uncertainty quantification via non-parametric bootstrap confidence intervals.

✨ Key Features

Multi-field validation - Binary (True/False), scalar (single values), and list (multiple values) data types
Partial labeling support - Handle datasets where different cases have labels for different subsets of fields
Dual usage modes - Validate pre-computed results OR run live LLM inference with validation
Comprehensive metrics - Precision, recall, F1/F2, accuracy, specificity with both micro and macro aggregation
Confidence analysis - Automatic performance breakdown by confidence levels
Statistical uncertainty - Non-parametric bootstrap confidence intervals for all performance metrics
Production ready - Parallel processing, intelligent caching, detailed progress tracking

🚀 Quick Start

Prerequisites

# Install from PyPI
pip install llmvalidate

# OR install from source
pip install -r requirements.txt  # Python 3.11+ required

Demo

python runme.py

Processes the included samples.csv (14 test cases covering all validation scenarios) and outputs timestamped results to validation_results/samples/:

Results CSV - Row-by-row comparison with confusion matrix counts and item-level details
Metrics CSV - Aggregated performance statistics with confidence breakdowns
CI Metrics CSV - Confidence intervals for metrics

Rows	Field Type	Test Scenarios
1-4	Binary (`Has metastasis`)	True Positive, True Negative, False Positive, False Negative
5-9	Scalar (`Diagnosis`, `Histology`)	Correct, incorrect, missing, spurious, and empty extractions
10-14	List (`Treatment Drugs`, `Test Results`)	Perfect match, spurious items, missing items, correct empty, mixed results

📊 Usage Modes

Mode 1: Validate Existing Results

When you have LLM predictions in Res: {Field Name} columns:

import pandas as pd
from src.validation import validate

df = pd.read_csv("data.csv", index_col="Patient ID")
# df must contain: "Field Name" and "Res: Field Name" columns

results_df, metrics_df = validate(
    source_df=df,
    fields=["Diagnosis", "Treatment"],  # or None for auto-detection
    structure_callback=None,
    output_folder="validation_results"
)

Mode 2: Live LLM Inference + Validation

from src.structured import StructuredResult, StructuredGroup, StructuredField
from src.utils import flatten_structured_result

def llm_callback(row, i, raw_text_column_name):
    raw_text = row[raw_text_column_name]
    # Your LLM inference logic here
    result = StructuredResult(
        groups=[StructuredGroup(
            group_name="medical",
            fields=[
                StructuredField(name="Diagnosis", value="Cancer", confidence="High"),
                StructuredField(name="Treatment", value=["Drug A"], confidence="Medium")
            ]
        )]
    )
    return flatten_structured_result(result), {}

results_df, metrics_df = validate(
    source_df=df,
    fields=["Diagnosis", "Treatment"],
    structure_callback=llm_callback,
    raw_text_column_name="medical_report",
    output_folder="validation_results",
    max_workers=4
)

📋 Input Data Requirements

DataFrame Format

Unique index - Each row must have a unique identifier (e.g., "Patient ID")
Label columns - Ground truth values for each field you want to validate
Result columns (Mode 1 only) - LLM predictions as Res: {Field Name} columns
Raw text column (Mode 2 only) - Source text for LLM inference (e.g., "medical_report")

Supported Field Types

Type	Description	Label Examples	Result Examples
Binary	True/False detection	`True`, `False`	`True`, `False`
Scalar	Single text/numeric value	`"Lung Cancer"` `42`	`"Breast Cancer"` `38`
List	Multiple values	`["Drug A", "Drug B"]` `"['Item1', 'Item2']"`	`["Drug A"]` `[]`

Special Value Handling

"-" = Labeled as "No information is available in the source document"
null/empty/NaN = Field not labeled/evaluated (supports partial labeling where different cases may have labels for different field subsets)
Lists - Can be Python lists ["a", "b"] or stringified "['a', 'b']" (auto-converted)

Partial Labeling Support

The framework supports partial labeling scenarios where:

Not every case needs labels for every field
Different cases can have labels for different subsets of fields
Missing labels (null/NaN) are handled gracefully in all metrics calculations
Use "-" when the document explicitly lacks information about a field
Use null/NaN when the field simply wasn't labeled for that case

📈 Output Files

The framework generates two timestamped CSV files for each validation run:

1. Results CSV (`YYYY-MM-DD HH-MM-SS results.csv`)

Row-level analysis with detailed per-case metrics:

Original Data:

All input columns (labels, raw text, etc.)
Res: {Field} columns with LLM predictions
Res: {Field} confidence and Res: {Field} justification (if available)

Binary Fields:

TP/FP/FN/TN: {Field} - Confusion matrix counts (1 or 0 per row)

Non-Binary Fields:

Cor/Inc/Mis/Spu: {Field} - Item counts per row
Cor/Inc/Mis/Spu: {Field} items - Actual item lists
Precision/Recall/F1/F2: {Field} - Per-row metrics (list fields only)

System Columns:

Sys: from cache - Whether result was cached (speeds up duplicate text)
Sys: exception - Error information if processing failed
Sys: time taken - Processing time per row in seconds

2. Metrics CSV (`YYYY-MM-DD HH-MM-SS metrics.csv`)

Aggregated statistics with confidence breakdowns:

Core Information:

field - Field name being evaluated
confidence - Confidence level ("Overall", "High", "Medium", "Low", etc.)
labeled cases - Total rows with ground truth labels
field-present cases - Rows where document has information about the field (label is not '-')

Binary Metrics: TP, TN, FP, FN, precision, recall, F1/F2, accuracy, specificity

Non-Binary Metrics: cor, inc, mis, spu, precision/recall/F1/F2 (micro), precision/recall/F1/F2 (macro)

⚡ Performance Metrics Explained

Binary Classification Metrics

For fields with True/False values (e.g., "Has metastasis"):

Confusion Matrix Counts

Count	Definition	Example
TP (True Positive)	Correctly predicted positive	Label: `True`, Prediction: `True` → TP=1
TN (True Negative)	Correctly predicted negative	Label: `False`, Prediction: `False` → TN=1
FP (False Positive)	Incorrectly predicted positive	Label: `False`, Prediction: `True` → FP=1
FN (False Negative)	Incorrectly predicted negative	Label: `True`, Prediction: `False` → FN=1

Binary Classification Formulas

Metric	Formula	Meaning
Precision	`TP / (TP + FP)`	Of all positive predictions, how many were correct?
Recall	`TP / (TP + FN)`	Of all actual positives, how many were found?
Accuracy	`(TP + TN) / (TP + TN + FP + FN)`	Overall percentage of correct predictions
Specificity	`TN / (TN + FP)`	Of all actual negatives, how many were correctly identified?

Structured Extraction Metrics

For scalar and list fields (e.g., "Diagnosis", "Treatment Drugs"):

Core Counts (Per Case Analysis)

Count	Definition	Example
Correct (Cor)	Items extracted correctly	Label: `["DrugA", "DrugB"]`, Prediction: `["DrugA"]` → Cor=1
Missing (Mis)	Items present in label but not extracted	(Same example) → Mis=1 (DrugB missing)
Spurious (Spu)	Items extracted but not in label	Label: `["DrugA"]`, Prediction: `["DrugA", "DrugC"]` → Spu=1
Incorrect (Inc)	Wrong values for scalar fields	Label: `"Cancer"`, Prediction: `"Diabetes"` → Inc=1

Structured Extraction Formulas

Metric	Formula	Meaning
Precision	`Cor / (Cor + Spu + Inc)`	Of all extracted items, how many were correct?
Recall	`Cor / (Cor + Mis + Inc)`	Of all labeled items, how many were correctly extracted?

Note: For scalar fields, Inc (incorrect) is used; for list fields, Inc is typically 0 since items are either correct, missing, or spurious.

The following formulas apply to both binary classification and structured extraction metrics:

Metric	Formula	Meaning
F1 Score	`2 × (P × R) / (P + R)`	Balanced harmonic mean of precision and recall
F2 Score	`5 × (P × R) / (4P + R)`	Recall-weighted F-score (emphasizes recall over precision)

Where P = Precision and R = Recall (calculated differently for each metric type).

Bootstrap Confidence Intervals

The framework includes statistical confidence interval estimation using non-parametric bootstrap resampling at the case level. This provides uncertainty quantification for all validation metrics.

Usage

from src.validation import bootstrap_CI

# After running validation to get results_df
ci_results = bootstrap_CI(
    res_df=results_df,           # Results from validate() function
    fields=["diagnosis", "treatment"],  # Fields to analyze (or None for auto-detect)
    n_bootstrap=5000,            # Number of bootstrap samples (default: 5000)
    ci=0.95,                     # Confidence level (default: 0.95 for 95% CI)
    random_state=42              # For reproducible results
)

Bootstrap Method

Resampling unit: Individual cases (not individual predictions)
Resampling strategy: Sample with replacement to preserve original dataset size
CI calculation: Percentile method using bootstrap distribution
Partial labeling: Handles missing labels gracefully - cases with missing labels for specific fields are excluded from calculations for those fields only
Metrics included: All validation metrics (precision, recall, F1, accuracy, etc.)

Output Format

The bootstrap_CI() function returns a DataFrame with confidence intervals for each field:

Column	Description
`field`	Field name (including 'exceptions' for system metrics and 'N={n}; CI={level}%' for parameters)
`labeled cases`	Number of labeled cases in the dataset
`{metric}: mean`	Bootstrap mean estimate
`{metric}: lower`	Lower bound of confidence interval
`{metric}: upper`	Upper bound of confidence interval

Example output:

        field  labeled cases  precision (micro): mean  precision (micro): lower  precision (micro): upper
0  exceptions          1000                       NaN                       NaN                       NaN
1   diagnosis          1000                      0.82                      0.79                      0.85
2   treatment          1000                      0.91                      0.88                      0.94
3  N=5000; CI=95%       NaN                       NaN                       NaN                       NaN

The final row contains bootstrap parameters for reference: sample size (N) and confidence interval level (CI).

Use Cases

Performance assessment: Quantify uncertainty in reported metrics
Model comparison: Determine if performance differences are statistically significant
Sample size planning: Understand precision of estimates with current dataset size
Publication: Report confidence intervals alongside point estimates

🛠️ Advanced Configuration

Parallel Processing

validate(
    source_df=df,
    fields=["diagnosis", "treatment"], 
    structure_callback=callback,
    max_workers=None,      # Auto-detect CPU count (or specify number)
    use_threads=True       # True for I/O-bound (LLM API calls), False for CPU-bound
)

Performance Features

Automatic caching - Identical raw text inputs are deduplicated and cached
Progress tracking - Real-time progress bar for long-running validations
Cache statistics - Check Sys: from cache column in results to monitor cache hits

Confidence Analysis

When LLM inference returns both extracted fields and their associated confidence levels, the framework automatically detects Res: {Field} confidence columns and generates:

Separate metrics for each unique confidence level found in your data
Overall metrics aggregating across all confidence levels
Useful for setting confidence thresholds and analyzing prediction reliability

🧪 Development & Testing

# Install development dependencies
pip install -r requirements.txt

# Run all tests
pytest  

# Run with coverage reporting
pytest --cov=src

# Run specific test modules
pytest tests/validate_test.py              # Core validation logic
pytest tests/compare_results_test.py       # Comparison algorithms  
pytest tests/compare_results_all_test.py   # End-to-end comparisons

📁 Project Structure

llm-validation-framework/
├── src/
│   ├── validation.py     # Main validation pipeline and metrics calculation
│   ├── structured.py     # Pydantic data models for LLM results
│   ├── utils.py         # Utility functions (list conversion, flattening)
│   └── standardize.py   # Data standardization helpers
├── tests/               # Comprehensive test suite
├── validation_results/  # Output directory (auto-created)
├── samples.csv         # Demo dataset with all validation scenarios  
├── runme.py           # Demo script
└── requirements.txt   # Dependencies (pandas, pydantic, tqdm, etc.)

🔧 Troubleshooting

Error	Solution
"Cannot infer fields"	Ensure DataFrame has both `{Field}` and `Res: {Field}` columns when `structure_callback=None`
"Missing fields"	Verify `fields` parameter contains column names that exist in your DataFrame
"Duplicate index"	Use `df.reset_index(drop=True)` or ensure your DataFrame index has unique values
Import/dependency errors	Run `pip install -r requirements.txt` and verify Python 3.11+
Slow performance	Enable parallel processing with `max_workers=None` and `use_threads=True` for LLM API calls

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
tests		tests
validation_results/samples		validation_results/samples
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
runme.py		runme.py
samples.csv		samples.csv

Folders and files

Latest commit

History

Repository files navigation

LLM Validation Framework

✨ Key Features

🚀 Quick Start

Prerequisites

Demo

📊 Usage Modes

Mode 1: Validate Existing Results

Mode 2: Live LLM Inference + Validation

📋 Input Data Requirements

DataFrame Format

Supported Field Types

Special Value Handling

Partial Labeling Support

📈 Output Files

1. Results CSV (YYYY-MM-DD HH-MM-SS results.csv)

2. Metrics CSV (YYYY-MM-DD HH-MM-SS metrics.csv)

⚡ Performance Metrics Explained

Binary Classification Metrics

Confusion Matrix Counts

Binary Classification Formulas

Structured Extraction Metrics

Core Counts (Per Case Analysis)

Structured Extraction Formulas

Bootstrap Confidence Intervals

Usage

Bootstrap Method

Output Format

Use Cases

🛠️ Advanced Configuration

Parallel Processing

Performance Features

Confidence Analysis

🧪 Development & Testing

📁 Project Structure

🔧 Troubleshooting

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Results CSV (`YYYY-MM-DD HH-MM-SS results.csv`)

2. Metrics CSV (`YYYY-MM-DD HH-MM-SS metrics.csv`)

Packages