Skip to content

Domain-agnostic framework for AI-powered evaluation of research papers against structured factor glossaries. Originally built for neuroscience predictive coding (TcGLO), generalizes to any field.

Notifications You must be signed in to change notification settings

vanderbilt-data-science/study-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Study Evaluation Framework

A domain-agnostic framework for evaluating research papers against structured factor glossaries using AI. Originally built for neuroscience predictive coding research, the architecture generalizes to any research field — electronics, economics, materials science, and beyond.

What It Does

Given a collection of research papers (PDFs) and a domain-specific glossary of factors, the framework:

  1. Extracts text from PDFs
  2. Scores each factor across multiple evaluation contexts using AI models (Gemini, Claude, local LLMs)
  3. Stores results in a standardized benchmark table
  4. Visualizes agreement across studies and models with interactive 3D scatter plots, comparison charts, and distance matrices
  5. Compares how different AI models interpret the same literature

The system uses a structured 4-part prompt (Role, Logic, Constraints, Task) to produce consistent, reproducible evaluations with full reasoning logs.

Quick Start

from core.config import load_domain_config
from core.evaluation import get_study_eval, get_info_from_pdf
from core.dashboard import StudyBenchmarkDashboard

# Load a domain configuration
config = load_domain_config('domains/neuroscience_predictive_coding.json')

# Create an interactive dashboard
dashboard = StudyBenchmarkDashboard(config, model_save_path='./output')

# Evaluate a paper
text = get_info_from_pdf('path/to/paper.pdf')
result = get_study_eval(text, 'gemini-2.5-pro', config)

# result['evaluations'] = {'LO': {...scores...}, 'GO': {...scores...}}
# result['first_author'] = 'Smith'
# result['reasoning_log_text'] = '...'

Included Domains

Neuroscience: Predictive Coding (TcGLO)

The founding domain. Evaluates neuroscience papers for evidence of predictive coding mechanisms:

  • 36 factors organized into 3 hypothesis groups:
    • H1 (Suppression): How the brain minimizes surprise — SST inhibition, adaptation, gain control
    • H2 (Propagation): How error signals travel feedforward — AMPA/NMDA, gamma oscillations, laminar profiles
    • H3 (Ubiquitousness): How universal the mechanism is — across cortical areas, species, modalities
  • 2 contexts: Local Oddball (LO) and Global Oddball (GO)
  • 3 study types: Empirical, Theoretical, Computational

Electronics: Circuit Architecture (Example)

A starter domain demonstrating the framework's flexibility:

  • 18 factors across 3 groups: Power Architecture, Signal Integrity, Reliability
  • 3 contexts: High Frequency, Low Frequency, DC Steady State
  • 4 study types: Simulation, Bench Test, Field Data, Analytical

Architecture

study-eval/
├── TcGLO_HPC_Local.ipynb          # Original self-contained notebook (unchanged)
│
├── domains/                         # Domain configurations
│   ├── neuroscience_predictive_coding.json
│   └── electronics_architecture.json
│
├── core/                            # Generalized Python framework
│   ├── config.py                    # DomainConfig loader + validator
│   ├── columns.py                   # Column naming (context_prefix + factor)
│   ├── prompts.py                   # 4-part prompt template system
│   ├── evaluation.py                # AI evaluation pipeline (N contexts)
│   ├── dashboard.py                 # Interactive benchmark dashboard
│   └── visualization.py             # Scatter dispatch (1D/2D/3D/radar)
│
├── notebooks/
│   └── generic_eval_demo.ipynb      # Demo: both domains with same code
│
└── .claude/skills/                  # Agent Skills for Claude Code
    ├── study-eval/                  # Core framework docs
    ├── study-eval-neuro/            # Neuroscience evaluation skill
    ├── study-eval-electronics/      # Electronics evaluation skill
    ├── study-eval-glossary/         # Glossary management
    └── study-eval-compare/          # Model comparison

See OVERVIEW.md for detailed architecture documentation.

How a DomainConfig Works

Everything domain-specific lives in a single JSON file:

{
  "domain": {"id": "your_field", "name": "Your Research Field"},
  "contexts": [
    {"id": "CTX1", "name": "Context One", "column_prefix": "Context_One", "description": "..."}
  ],
  "theory_groups": [
    {"id": "G1", "name": "Group Name", "full_label": "G1 (Group Name)", "description": "..."}
  ],
  "study_types": [
    {"id": "experimental", "label": "Experimental", "color": "#27ae60", "symbol": "circle"}
  ],
  "scoring": {"scale_min": -1.0, "scale_max": 1.0, "scale_descriptions": [...]},
  "prompts": {"role": "...", "logic": "...", "constraints": "...", "task_template": "..."},
  "glossary": {
    "Factor Name": {"id": 1, "def": "...", "rel": [...], "tag": "Quantitative", "modes": ["CTX1"], "theory_group": "G1"}
  }
}

The framework automatically adapts column naming, prompts, dashboard tabs, and visualizations to your config.

Adding a New Domain

  1. Copy domains/electronics_architecture.json as a template
  2. Define your contexts, theory groups, study types, and glossary
  3. The framework validates cross-references (modes reference valid contexts, theory groups exist)
  4. Optionally create a Claude Code skill in .claude/skills/study-eval-<your-domain>/

Visualization

The framework auto-selects visualization type based on your theory group count:

Groups Chart Type
1 Strip plot
2 2D scatter
3 3D scatter with projection lines
4+ Radar/spider chart

All plots are interactive (Plotly) and saved as HTML.

Multi-Model Comparison

Run the same papers through different AI models, then compare:

  • Agent Comparison: Side-by-side scores per study, per theory group
  • Summary Statistics: Average scores per model across contexts and groups
  • Distance Matrix: Pairwise squared-difference heatmap showing model agreement

Requirements

  • Python 3.10+
  • pandas, numpy
  • plotly (for visualization)
  • PyPDF2 (for PDF extraction)
  • ipywidgets (for interactive dashboard, optional)
  • OpenAI client (for local model support, optional)

Background

This framework grew out of the TcGLO (Theory-comparison Glossary Local/Oddball) benchmark for evaluating predictive coding theories in neuroscience. The original notebook (TcGLO_HPC_Local.ipynb) remains fully self-contained and backward-compatible — the core/ modules and domains/ configs are a parallel extraction that generalizes the same approach for any field.

Contributing

  1. Fork the repo
  2. Create a feature branch
  3. To add a new research domain, follow the guide in .claude/skills/study-eval/domain-config-schema.md
  4. Submit a PR with your domain config + any skill files

License

TBD

About

Domain-agnostic framework for AI-powered evaluation of research papers against structured factor glossaries. Originally built for neuroscience predictive coding (TcGLO), generalizes to any field.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors