Study Evaluation Framework

A domain-agnostic framework for evaluating research papers against structured factor glossaries using AI. Originally built for neuroscience predictive coding research, the architecture generalizes to any research field — electronics, economics, materials science, and beyond.

What It Does

Given a collection of research papers (PDFs) and a domain-specific glossary of factors, the framework:

Extracts text from PDFs
Scores each factor across multiple evaluation contexts using AI models (Gemini, Claude, local LLMs)
Stores results in a standardized benchmark table
Visualizes agreement across studies and models with interactive 3D scatter plots, comparison charts, and distance matrices
Compares how different AI models interpret the same literature

The system uses a structured 4-part prompt (Role, Logic, Constraints, Task) to produce consistent, reproducible evaluations with full reasoning logs.

Quick Start

from core.config import load_domain_config
from core.evaluation import get_study_eval, get_info_from_pdf
from core.dashboard import StudyBenchmarkDashboard

# Load a domain configuration
config = load_domain_config('domains/neuroscience_predictive_coding.json')

# Create an interactive dashboard
dashboard = StudyBenchmarkDashboard(config, model_save_path='./output')

# Evaluate a paper
text = get_info_from_pdf('path/to/paper.pdf')
result = get_study_eval(text, 'gemini-2.5-pro', config)

# result['evaluations'] = {'LO': {...scores...}, 'GO': {...scores...}}
# result['first_author'] = 'Smith'
# result['reasoning_log_text'] = '...'

Included Domains

Neuroscience: Predictive Coding (TcGLO)

The founding domain. Evaluates neuroscience papers for evidence of predictive coding mechanisms:

36 factors organized into 3 hypothesis groups:
- H1 (Suppression): How the brain minimizes surprise — SST inhibition, adaptation, gain control
- H2 (Propagation): How error signals travel feedforward — AMPA/NMDA, gamma oscillations, laminar profiles
- H3 (Ubiquitousness): How universal the mechanism is — across cortical areas, species, modalities
2 contexts: Local Oddball (LO) and Global Oddball (GO)
3 study types: Empirical, Theoretical, Computational

Electronics: Circuit Architecture (Example)

A starter domain demonstrating the framework's flexibility:

18 factors across 3 groups: Power Architecture, Signal Integrity, Reliability
3 contexts: High Frequency, Low Frequency, DC Steady State
4 study types: Simulation, Bench Test, Field Data, Analytical

Architecture

study-eval/
├── TcGLO_HPC_Local.ipynb          # Original self-contained notebook (unchanged)
│
├── domains/                         # Domain configurations
│   ├── neuroscience_predictive_coding.json
│   └── electronics_architecture.json
│
├── core/                            # Generalized Python framework
│   ├── config.py                    # DomainConfig loader + validator
│   ├── columns.py                   # Column naming (context_prefix + factor)
│   ├── prompts.py                   # 4-part prompt template system
│   ├── evaluation.py                # AI evaluation pipeline (N contexts)
│   ├── dashboard.py                 # Interactive benchmark dashboard
│   └── visualization.py             # Scatter dispatch (1D/2D/3D/radar)
│
├── notebooks/
│   └── generic_eval_demo.ipynb      # Demo: both domains with same code
│
└── .claude/skills/                  # Agent Skills for Claude Code
    ├── study-eval/                  # Core framework docs
    ├── study-eval-neuro/            # Neuroscience evaluation skill
    ├── study-eval-electronics/      # Electronics evaluation skill
    ├── study-eval-glossary/         # Glossary management
    └── study-eval-compare/          # Model comparison

See OVERVIEW.md for detailed architecture documentation.

How a DomainConfig Works

Everything domain-specific lives in a single JSON file:

{
  "domain": {"id": "your_field", "name": "Your Research Field"},
  "contexts": [
    {"id": "CTX1", "name": "Context One", "column_prefix": "Context_One", "description": "..."}
  ],
  "theory_groups": [
    {"id": "G1", "name": "Group Name", "full_label": "G1 (Group Name)", "description": "..."}
  ],
  "study_types": [
    {"id": "experimental", "label": "Experimental", "color": "#27ae60", "symbol": "circle"}
  ],
  "scoring": {"scale_min": -1.0, "scale_max": 1.0, "scale_descriptions": [...]},
  "prompts": {"role": "...", "logic": "...", "constraints": "...", "task_template": "..."},
  "glossary": {
    "Factor Name": {"id": 1, "def": "...", "rel": [...], "tag": "Quantitative", "modes": ["CTX1"], "theory_group": "G1"}
  }
}

The framework automatically adapts column naming, prompts, dashboard tabs, and visualizations to your config.

Adding a New Domain

Copy domains/electronics_architecture.json as a template
Define your contexts, theory groups, study types, and glossary
The framework validates cross-references (modes reference valid contexts, theory groups exist)
Optionally create a Claude Code skill in .claude/skills/study-eval-<your-domain>/

Visualization

The framework auto-selects visualization type based on your theory group count:

Groups	Chart Type
1	Strip plot
2	2D scatter
3	3D scatter with projection lines
4+	Radar/spider chart

All plots are interactive (Plotly) and saved as HTML.

Multi-Model Comparison

Run the same papers through different AI models, then compare:

Agent Comparison: Side-by-side scores per study, per theory group
Summary Statistics: Average scores per model across contexts and groups
Distance Matrix: Pairwise squared-difference heatmap showing model agreement

Requirements

Python 3.10+
pandas, numpy
plotly (for visualization)
PyPDF2 (for PDF extraction)
ipywidgets (for interactive dashboard, optional)
OpenAI client (for local model support, optional)

Background

This framework grew out of the TcGLO (Theory-comparison Glossary Local/Oddball) benchmark for evaluating predictive coding theories in neuroscience. The original notebook (TcGLO_HPC_Local.ipynb) remains fully self-contained and backward-compatible — the core/ modules and domains/ configs are a parallel extraction that generalizes the same approach for any field.

Contributing

Fork the repo
Create a feature branch
To add a new research domain, follow the guide in .claude/skills/study-eval/domain-config-schema.md
Submit a PR with your domain config + any skill files

License

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Study Evaluation Framework

What It Does

Quick Start

Included Domains

Neuroscience: Predictive Coding (TcGLO)

Electronics: Circuit Architecture (Example)

Architecture

How a DomainConfig Works

Adding a New Domain

Visualization

Multi-Model Comparison

Requirements

Background

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.claude/skills		.claude/skills
core		core
domains		domains
notebooks		notebooks
.gitignore		.gitignore
OVERVIEW.md		OVERVIEW.md
README.md		README.md
TcGLO_HPC_Local.ipynb		TcGLO_HPC_Local.ipynb

vanderbilt-data-science/study-eval

Folders and files

Latest commit

History

Repository files navigation

Study Evaluation Framework

What It Does

Quick Start

Included Domains

Neuroscience: Predictive Coding (TcGLO)

Electronics: Circuit Architecture (Example)

Architecture

How a DomainConfig Works

Adding a New Domain

Visualization

Multi-Model Comparison

Requirements

Background

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages