This repository contains an implementation of the LexGuide framework, a proactive dialogue system for legal QA with:
- Retrieval-Augmented Generation (RAG)
- Hierarchical topic organization (BERTopic)
- System-driven follow-up generation
Also included are baselines for comparison:
- RAG-Basic
- RAG-MMR
- ConvRAG
Place your uploaded files and point to them when running scripts:
Corpus:
- ../path/non_pdf_urls.json
- ../path/pdf_urls.json
EUDial Dataset:
- ../path/conversations.jsonl
- ../path/conversations_turns.jsonl
python -m 2_lexguide.scripts.build_corpus \
--non_pdf_urls /path/non_pdf_urls.json \
--pdf_urls /path/pdf_urls.json \
--output_dir /path/outputspython -m 2_lexguide.scripts.build_index \
--output_dir outputs \
--embed_model nlpaueb/bert-base-uncased-eurlexDepth-2 (EUDial-style) hierarchy + BFS: Hierarchy mode = shallow Agglomerative hierarchy (binary merges) for experiments: Hierarchy mode = agglomerative
python -m 2_lexguide.scripts.run_lexguide \
--dialogues_file ./data/eudial_test.jsonl \
--output_dir ./outputs \
--provider openai \
--model gpt-4o-mini \
--strategy BFS \
--tau 0.6
python -m 2_lexguide.scripts.run_lexguide \
--dialogues_file ./conversations.jsonl \
--output_dir ./outputs \
--provider groq \
--model llama-3.1-8b-instant \
--strategy BFS
python -m 2_lexguide.scripts.run_lexguide \
--dialogues_file ./conversations.jsonl \
--output_dir ./outputs \
--provider ollama \
--model gemma2:2b \
--strategy BFSpython -m 2_lexguide.scripts.run_experiments
--dialogues_file ./test_subset.jsonl
--output_dir ./test_outputs
--strategy BFS
--use_mmr
python -m 2_lexguide.scripts.run_experiments \
--dialogues_file ./conversations.jsonl \
--output_dir ./outputs \This collection of scripts evaluation and eval folder provides comprehensive evaluation capabilities for LexGuide experiment results. One can evaluate individual JSONL run files and generate detailed comparison tables and statistical analyses.
For the fastest evaluation of your runs directory:
python quick_eval.py /path/to/your/runs/directory results.csv <path>/conversations_normalized.jsonlThis will:
- Find all JSONL files in the directory
- Evaluate each one using your existing
eval_pipeline.py - Generate a summary table in the console
- Save detailed results to CSV
evaluate_runs.py- Main Evaluation Script
The primary script for evaluating individual JSONL run files with full control.
Usage:
python evaluate_runs.py --runs_dir /path/to/runs --output evaluation_results.csv /path/to/conversations_normalized.jsonlKey Features:
- Automatically detects method and model names from filenames
- Handles expected filename format:
runs_{METHOD}__{MODEL}.jsonl - Comprehensive error handling and logging
- Generates both CSV output and console summary
- Uses optimized balanced metrics computation
Parameters:
--runs_dir: Directory containing JSONL run files (required)--output: Output CSV filename (default: evaluation_results.csv)--normalized_path: Path of conversations_normalized.jsonl file--embed_model: Embedding model for evaluation (default: nlpaueb/bert-base-uncased-eurlex)--pattern: File pattern to match (default: *.jsonl)--use_balanced: Use balanced metrics computation (recommended, default: True)--json_output: Also save results as JSON--verbose: Enable verbose logging
batch_evaluator.py- Advanced Analysis
Advanced evaluation with statistical analysis and comparison features.
Usage:
python batch_evaluator.py --runs_dir /path/to/runs --output_dir analysis_resultsKey Features:
- Statistical significance testing between methods
- Method vs. model performance comparisons
- Effect size calculations (Cohen's d)
- Comprehensive summary reports
- Multiple output formats (CSV, JSON, TXT)
Parameters:
--comparison_mode: Choose 'method' or 'model' for primary comparison--metric: Primary metric for comparisons (default: groundedness_percent)--output_dir: Directory for comprehensive results
quick_eval.py- Fast Evaluation
Minimal setup script for quick results.
Usage:
python quick_eval.py /path/to/runs [output.csv] /path/to/conversations.normalized.jsonlThe scripts expect your run files to follow this naming convention:
runs_{METHOD}__{MODEL}.jsonl
Examples:
runs_RAG_Basic__gpt-4o-mini.jsonlruns_LexGuide__llama-3.1-8b-instant.jsonlruns_ConvRAG__gemma-2b-it.jsonl
The scripts evaluate the following metrics:
Answer Quality Metrics
- Completeness (ROUGE-L): Text overlap with gold responses
- Readability (FRE): Flesch Reading Ease score
- Groundedness (%): Percentage of generated content grounded in retrieved documents
- Legal Relevance (BERTScore): Semantic similarity to gold responses
Follow-up Question Metrics
- Relevance: Semantic relevance to gold follow-ups
- Diversity: Diversity among generated follow-ups
- Contextual Relevance: Relevance to conversation context
- Temporal Consistency: Consistency in conversation flow
- Topic Coverage (%): Percentage of gold topics covered
Console Summary Formatted table showing key metrics:
📊 EVALUATION RESULTS SUMMARY
===============================================
Method Model Completeness Groundedness ...
RAG_Basic gpt-4o-mini 0.245 67.3 ...
LexGuide gpt-4o-mini 0.289 72.1 ...
...
📈 METHOD AVERAGES:
Method Completeness Groundedness ...
RAG_Basic 0.240 65.2 ...
LexGuide 0.285 70.8 ...