eBible Tools

A comprehensive toolkit for biblical text processing, translation benchmarking, and semantic search with advanced query capabilities.

Features

Unified Query Interface: Switch between BM25, TF-IDF, and context-aware search methods
Four Specialized Benchmarks: Biblical recall, translation corrigibility, source effects, and prompt optimization
Multi-Model Comparison: Compare multiple LLM models side-by-side on the same benchmarks
Multi-Language Testing: Test translation benchmarks across ALL downloaded target languages
Multi-Provider LLM Support: Use 100+ LLM providers (OpenAI, Anthropic, Google, Groq, etc.) via OpenRouter with a single API key
Professional MT Metrics: Industry-standard evaluation including chrF+, Edit Distance, TER
Semantic Search: Advanced context-aware search with coverage weighting and branching
Corpus Management: Tools for downloading and processing biblical texts
Standardized XML Output: Consistent response formatting across all models

Quick Start

# Clone and setup
git clone <repository-url>
cd ebibletools
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

# Set up your API keys in .env file
cp benchmarks/env.template .env
# Edit .env with your API keys

# Download some biblical texts
python ebible_downloader.py

# Run benchmarks (from benchmarks directory)
cd benchmarks
python biblical_recall_benchmark.py --num-tests 20
python context_corrigibility_benchmark.py --model openai/gpt-4o --example-counts 0 3 5
python true_source_benchmark.py --num-tests 15
python power_prompt_benchmark.py --num-tests 12

Benchmarks

📖 Biblical Recall Benchmark

Tests how well models can recall biblical text when given verse references.

# Single model (old format automatically converted)
python biblical_recall_benchmark.py --num-tests 20 --model gpt-4o --output recall_results.json

# Single model (OpenRouter format)
python biblical_recall_benchmark.py --num-tests 20 --model openai/gpt-4o --output recall_results.json

# Multi-model comparison
python biblical_recall_benchmark.py --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 --num-tests 10

What it tests:

Given a biblical reference (e.g., "GEN 1:1", "HEB 4:12"), can the model quote the verse?
Uses aligned reference file (vref.txt) with the actual biblical corpus
Tests random verses from across the entire Bible (41,900+ verses)
Tests pure recall ability without completion hints
Multi-model support: Compare recall abilities across different models

Target Text: Uses the verses from the source file itself (eng-engULB.txt)

Metrics: chrF+, Edit Distance similarity

🔄 Context Corrigibility Benchmark

Tests how in-context examples improve translation accuracy across ALL target languages.

# Single model - tests on ALL downloaded languages
python context_corrigibility_benchmark.py --model openai/gpt-4o --example-counts 0 3 5 --num-tests 10

# Multi-model comparison
python context_corrigibility_benchmark.py --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 --example-counts 0 3 --num-tests 5

What it tests:

Translation quality with 0, 3, 5 examples across ALL target languages
How much context examples improve performance per language
Corrigibility analysis (improvement measurement)
NEW: Tests on every downloaded target language file, not just one random language

Target Text: Uses ALL available target language files in the corpus (except source language)

Metrics: chrF+, Edit Distance with improvement tracking, organized by language

🎯 True Source Benchmark

Tests how having source text affects translation accuracy across ALL target languages.

# Single model - tests on ALL downloaded languages
python true_source_benchmark.py --num-tests 15 --model openai/gpt-4o --output source_results.json

# Multi-model comparison
python true_source_benchmark.py --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 --num-tests 10

What it tests:

With source: Normal translation with correct source text
Without source: Translation from memory (tests memorization vs. bias)
Measures the impact of source text availability across all languages
Reveals whether source text biases translation output
NEW: Tests on every downloaded target language file, not just one random language

Target Text: Uses ALL available target language files in the corpus (except source language)

Metrics: chrF+, Edit Distance with source effect analysis, organized by language

💪 Power Prompt Benchmark

Tests effectiveness of different prompt styles for translation across ALL target languages.

# Single model - tests on ALL downloaded languages
python power_prompt_benchmark.py --num-tests 12 --model gpt-4o --output prompt_results.json

# Multi-model comparison
python power_prompt_benchmark.py --models gpt-3.5-turbo claude-3-haiku-20240307 gpt-4o-mini gpt-4o claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 --num-tests 8

What it tests:

4 key prompt styles (basic, expert, biblical scholar, direct) across all languages
Prompt effectiveness ranking per language and overall
Statistical comparison of prompt impact
NEW: Tests on every downloaded target language file, not just one random language

Target Text: Uses ALL available target language files in the corpus (except source language)

Metrics: chrF+, Edit Distance with overall ranking, organized by language

Recommended Model Testing Configuration

For comprehensive benchmarking, we recommend testing on this set of models covering different providers and capabilities:

# Recommended models for comprehensive testing
models_to_test = [
    "gpt-3.5-turbo",                    # OpenAI - Fast, cost-effective
    "claude-3-haiku-20240307",          # Anthropic - Fast, efficient  
    "gpt-4o-mini",                      # OpenAI - Balanced performance
    "gpt-4o",                           # OpenAI - High performance
    "claude-3-5-sonnet-20240620",       # Anthropic - Advanced reasoning
    "anthropic/claude-sonnet-4-20250514", # Anthropic - Latest model
]

Running Multi-Model Benchmarks

# Biblical Recall (multi-model)
python biblical_recall_benchmark.py \
  --models gpt-3.5-turbo claude-3-haiku-20240307 gpt-4o-mini gpt-4o claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 \
  --num-tests 20 \
  --output multi_model_recall.json

# Context Corrigibility (run separately for each model due to multi-language complexity)
for model in "gpt-3.5-turbo" "claude-3-haiku-20240307" "gpt-4o-mini" "gpt-4o" "claude-3-5-sonnet-20240620" "anthropic/claude-sonnet-4-20250514"; do
  python context_corrigibility_benchmark.py \
    --model "$model" \
    --example-counts 0 3 5 \
    --num-tests 10 \
    --output "context_${model//\//_}.json"
done

# True Source (run separately for each model)
for model in "gpt-3.5-turbo" "claude-3-haiku-20240307" "gpt-4o-mini" "gpt-4o" "claude-3-5-sonnet-20240620" "anthropic/claude-sonnet-4-20250514"; do
  python true_source_benchmark.py \
    --model "$model" \
    --num-tests 15 \
    --output "source_${model//\//_}.json"
done

# Power Prompt (run separately for each model)
for model in "gpt-3.5-turbo" "claude-3-haiku-20240307" "gpt-4o-mini" "gpt-4o" "claude-3-5-sonnet-20240620" "anthropic/claude-sonnet-4-20250514"; do
  python power_prompt_benchmark.py \
    --model "$model" \
    --num-tests 12 \
    --output "prompt_${model//\//_}.json"
done

Important Notes:

Biblical Recall supports multi-model comparison in a single run
Translation benchmarks (Context, True Source, Power Prompt) now test on ALL target languages, making them computationally intensive
For translation benchmarks, run each model separately to avoid timeouts
Results are organized by language with overall aggregated statistics

Multi-Language Testing

The translation benchmarks now test on ALL downloaded target language files instead of just one random language. This provides comprehensive coverage but requires more time and API calls.

What this means:

If you have 50 language files downloaded, each translation benchmark will test on all 50 languages
Results include both per-language detailed results and overall aggregated statistics
Much more comprehensive evaluation of model translation capabilities
Significantly longer runtime (plan accordingly)

Performance Considerations:

Start with smaller --num-tests values (5-10) when testing multiple models
Consider running benchmarks overnight for comprehensive results
Monitor API usage and costs when testing multiple models on all languages

Query System

The unified query system supports three different search methods:

BM25 Query (Default)

from query import Query

# Pure BM25 scoring
query = Query(method='bm25')
results = query.search_by_text("love your neighbor", limit=5)

TF-IDF Query

# TF-IDF based similarity
query = Query(method='tfidf')
results = query.search_by_text("forgiveness", limit=5)

Context Query (Advanced)

# BM25 + branching search with coverage weighting
query = Query(method='context')
results = query.search_by_text("faith and hope", limit=5)

All methods support the same interface:

search_by_text(text, limit=10) - Search by text content
search_by_line(line_number, limit=10) - Search by line reference

Translation Quality Metrics

The metrics/ module provides comprehensive MT evaluation with professional-grade implementations:

Quick Example

from metrics import chrF_plus, normalized_edit_distance, ter_score

hypothesis = "The cat sits on the mat."
reference = "The cat sat on the mat."

print(f"chrF+: {chrF_plus(hypothesis, reference):.4f}")
print(f"Edit Distance: {normalized_edit_distance(hypothesis, reference):.4f}")
print(f"TER: {ter_score(hypothesis, reference):.4f}")

Available Metrics

chrF/chrF+/chrF++: Character n-gram F-score variants
Edit Distance: Levenshtein distance (raw and normalized)
TER: Translation Error Rate

All metrics include:

Professional implementations using established libraries
Comprehensive linguistic processing
Batch processing capabilities
Clear error handling with fast failure

The metrics are universal and work with any language or script without requiring language-specific resources.

Benchmark Command Reference

Common Options

All benchmarks support these standard options:

Option	Default	Description
`--corpus-dir`	`../Corpus`	Directory containing corpus files
`--source-file`	`eng-engULB.txt`	Source language file
`--model`	`gpt-4o`	Model to use (OpenRouter format: `provider/model-name`, old format auto-converted)
`--output`	None	Save detailed results to JSON file

Note: API keys are automatically loaded from your .env file - no need to specify them manually.

Biblical Recall Options

# Single model
python biblical_recall_benchmark.py \
  --num-tests 20 \
  --model openai/gpt-4o \
  --output recall_results.json

# Multi-model comparison  
python biblical_recall_benchmark.py \
  --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 \
  --num-tests 10 \
  --output multi_model_recall.json

Context Corrigibility Options

# Single model - tests ALL target languages
python context_corrigibility_benchmark.py \
  --model openai/gpt-4o \
  --query-method context \
  --num-tests 10 \
  --example-counts 0 3 5 \
  --output context_results.json

# Multi-model comparison (separate runs recommended)
python context_corrigibility_benchmark.py \
  --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 \
  --num-tests 5 \
  --example-counts 0 3 \
  --output multi_context_results.json

True Source Options

# Single model - tests ALL target languages
python true_source_benchmark.py \
  --num-tests 15 \
  --model openai/gpt-4o \
  --output source_results.json

# Multi-model comparison (separate runs recommended)
python true_source_benchmark.py \
  --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 \
  --num-tests 10 \
  --output multi_source_results.json

Power Prompt Options

# Single model - tests ALL target languages
python power_prompt_benchmark.py \
  --num-tests 12 \
  --model openai/gpt-4o \
  --output prompt_results.json

# Multi-model comparison (separate runs recommended)
python power_prompt_benchmark.py \
  --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 \
  --num-tests 8 \
  --output multi_prompt_results.json

LLM Provider Support

All benchmarks use OpenRouter via the OpenAI package for universal LLM access. OpenRouter provides a single API key that works with 100+ providers, eliminating the need for multiple API keys.

Supported Providers

Models are specified in OpenRouter format: provider/model-name

OpenAI: openai/gpt-4o, openai/gpt-4-turbo, openai/gpt-3.5-turbo
Anthropic: anthropic/claude-3-5-sonnet-20241022, anthropic/claude-3-haiku-20240307
Google: google/gemini-pro, google/gemini-pro-vision
Groq: groq/llama-3.1-70b-versatile, groq/mixtral-8x7b-32768
And 100+ more providers - See OpenRouter Models

Environment Setup

Create a .env file in your project root with your OpenRouter API key:

# Copy the template
cp benchmarks/env.template .env

# Edit .env with your OpenRouter API key
OPENROUTER_API_KEY=your-openrouter-api-key

Get your API key: OpenRouter API Keys

Note: Old format model names (e.g., gpt-4o) are automatically converted to OpenRouter format (openai/gpt-4o) for backward compatibility.

Multi-Model Comparison

The biblical recall benchmark supports comparing multiple models:

# Compare all recommended models on biblical recall
python biblical_recall_benchmark.py --models openai/gpt-3.5-turbo anthropic/claude-3-haiku-20240307 openai/gpt-4o-mini openai/gpt-4o anthropic/claude-3-5-sonnet-20240620 anthropic/claude-sonnet-4-20250514 --num-tests 20

Output includes:

Side-by-side performance comparison
Ranking with performance differences
Best model highlighted with ⭐
Statistical significance indicators

Example output:

BIBLICAL RECALL RESULTS - MODEL COMPARISON
==========================================

gpt-4o ⭐ (best)
  chrF+: 0.734±0.156
  Edit:  0.698±0.134
  High accuracy: 3/20 (15.0%)

claude-3-opus
  chrF+: 0.701±0.142
  Edit:  0.672±0.128
  High accuracy: 2/20 (10.0%)

gemini-pro
  chrF+: 0.623±0.189
  Edit:  0.587±0.156
  High accuracy: 1/20 (5.0%)

RANKING SUMMARY
--------------------------------
1. gpt-4o            chrF+: 0.734
2. claude-3-opus     chrF+: 0.701 (-0.033)
3. gemini-pro        chrF+: 0.623 (-0.111)

Project Structure

ebibletools/
├── query/                  # Query system
│   ├── __init__.py        # Unified Query interface
│   ├── base.py            # Base query class
│   ├── bm25_query.py      # BM25 implementation
│   ├── tfidf_query.py     # TF-IDF implementation
│   └── contextquery.py    # Context-aware search
├── metrics/               # Translation quality metrics
│   ├── __init__.py       # Metric imports
│   ├── chrf.py           # chrF variants
│   ├── edit_distance.py  # Levenshtein distance
│   ├── ter.py            # Translation Error Rate
│   └── README.md         # Detailed metric documentation
├── benchmarks/           # Benchmarking framework
│   ├── benchmark_utils.py            # Shared utilities
│   ├── biblical_recall_benchmark.py  # Biblical memory testing (multi-model)
│   ├── context_corrigibility_benchmark.py  # Context learning analysis
│   ├── true_source_benchmark.py      # Source text effect analysis
│   ├── power_prompt_benchmark.py     # Prompt optimization testing
│   ├── translation_benchmark.py     # Comprehensive translation benchmark
│   ├── data/
│   │   └── vref.txt      # Biblical verse references (41,900+ verses)
│   └── env.template      # Environment variable template
├── Corpus/               # Biblical text corpus
├── ebible_downloader.py  # Corpus download utility
└── requirements.txt      # Dependencies

Dependencies

All dependencies are required for proper functionality:

pip install -r requirements.txt

Core Requirements

requests - HTTP requests for downloading
openai - OpenAI package configured for OpenRouter API
scikit-learn - TF-IDF and similarity calculations
tqdm - Progress bars
numpy - Numerical computations
python-dotenv - Environment variable management

Translation Metrics

nltk - Natural language processing
torch - Required for some metrics
transformers - Transformer models

The system will fail clearly if any required dependencies are missing.

Usage Examples

Basic Benchmark Usage

from benchmarks.biblical_recall_benchmark import BiblicalRecallBenchmark

# Test biblical recall (single model)
benchmark = BiblicalRecallBenchmark(
    corpus_dir="Corpus",
    source_file="eng-engULB.txt",
    models=["openai/gpt-4o"]  # OpenRouter format: provider/model-name
)
benchmark.run_benchmark(num_tests=20, output_file="recall_results.json")

# Multi-model comparison
benchmark = BiblicalRecallBenchmark(
    corpus_dir="Corpus", 
    source_file="eng-engULB.txt",
    models=["gpt-3.5-turbo", "claude-3-haiku-20240307", "gpt-4o-mini", "gpt-4o", "claude-3-5-sonnet-20240620", "anthropic/claude-sonnet-4-20250514"]
)
benchmark.run_benchmark(num_tests=10, output_file="multi_model_recall.json")

Context Learning Analysis

from benchmarks.context_corrigibility_benchmark import ContextCorrigibilityBenchmark

# Test context learning
benchmark = ContextCorrigibilityBenchmark(
    corpus_dir="Corpus",
    source_file="eng-engULB.txt",
    model="gpt-4o"
)
benchmark.run_benchmark(
    num_tests=10,
    example_counts=[0, 3, 5, 10],
    output_file="context_comparison.json"
)

Translation Evaluation

from metrics import chrF_plus, normalized_edit_distance, ter_score

def evaluate_translation(hypothesis, reference):
    """Comprehensive translation evaluation"""
    scores = {
        'chrF+': chrF_plus(hypothesis, reference),
        'Edit Distance': 1.0 - normalized_edit_distance(hypothesis, reference),
        'TER': max(0.0, 1.0 - ter_score(hypothesis, reference))
    }
    return scores

# Example evaluation
hypothesis = "Love your neighbor as yourself."
reference = "You shall love your neighbor as yourself."

scores = evaluate_translation(hypothesis, reference)
for metric, score in scores.items():
    print(f"{metric}: {score:.4f}")

Batch Processing

# Run all benchmarks for comprehensive analysis
models = ["gpt-3.5-turbo", "claude-3-haiku-20240307", "gpt-4o-mini", "gpt-4o", "claude-3-5-sonnet-20240620", "anthropic/claude-sonnet-4-20250514"]

# Multi-model biblical recall
recall_benchmark = BiblicalRecallBenchmark("Corpus", "eng-engULB.txt", models)
recall_benchmark.run_benchmark(20, "multi_model_recall.json")

# Individual model testing on other benchmarks
for model in models:
    print(f"Testing {model}...")
    
    # Prompt power
    prompt_benchmark = PowerPromptBenchmark("Corpus", "eng-engULB.txt", model)
    prompt_benchmark.run_benchmark(12, f"prompt_{model.replace('/', '_')}.json")
    
    # Context corrigibility  
    context_benchmark = ContextCorrigibilityBenchmark("Corpus", "eng-engULB.txt", model)
    context_benchmark.run_benchmark(10, [0, 3, 5], f"context_{model.replace('/', '_')}.json")

XML Output Format

All benchmarks use standardized XML output formatting for consistent parsing across models:

Biblical Recall: <verse>biblical verse here</verse>
Translation Benchmarks: <translation>your translation here</translation>

This ensures reliable extraction of model responses regardless of the LLM provider.

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

[Add your license information here]

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.cursor/rules		.cursor/rules
benchmarks		benchmarks
metrics		metrics
query		query
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
analyze_corpus.py		analyze_corpus.py
colab_benchmark_runner.py		colab_benchmark_runner.py
ebible_downloader.py		ebible_downloader.py
example.py		example.py
most_complete_filenames.txt		most_complete_filenames.txt
most_complete_per_language.json		most_complete_per_language.json
readmeresults.md		readmeresults.md
requirements.txt		requirements.txt
test_multilang_benchmarks.py		test_multilang_benchmarks.py
visualize_query_methods.py		visualize_query_methods.py

genesis-ai-dev/ebibletools

Folders and files

Latest commit

History

Repository files navigation

eBible Tools

Features

Quick Start

Benchmarks

📖 Biblical Recall Benchmark

🔄 Context Corrigibility Benchmark

🎯 True Source Benchmark

💪 Power Prompt Benchmark

Recommended Model Testing Configuration

Running Multi-Model Benchmarks

Multi-Language Testing

Query System

BM25 Query (Default)

TF-IDF Query

Context Query (Advanced)

Translation Quality Metrics

Quick Example

Available Metrics

Benchmark Command Reference

Common Options

Biblical Recall Options

Context Corrigibility Options

True Source Options

Power Prompt Options

LLM Provider Support

Supported Providers

Environment Setup

Multi-Model Comparison

Project Structure

Dependencies

Core Requirements

Translation Metrics

Usage Examples

Basic Benchmark Usage

Context Learning Analysis

Translation Evaluation

Batch Processing

XML Output Format

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages