- Rationale & Vision
- Key Features
- Architecture Overview
- Prerequisites
- Installation & Setup
- Running the System
- How It Works
- Configuration
- Development Workflow
- Troubleshooting
- Additional Documentation
The scientific community faces a reproducibility crisis: many published research papers cannot be independently replicated due to:
- Missing or incomplete code repositories
- Unavailable or poorly documented datasets
- Ambiguous experimental protocols and hyperparameters
- Lack of detailed methodology descriptions
- Inconsistent reporting of statistical procedures
Manual evaluation of paper reproducibility is:
- Time-consuming: Requires expert reviewers to spend hours per paper
- Inconsistent: Different reviewers may apply different standards
- Not scalable: Impossible to evaluate thousands of papers at conferences
- Subjective: Human bias in interpretation of criteria
PaperSnitch automates the reproducibility assessment process using:
- Multi-Step Retrieval-Augmented Generation (RAG): Intelligently retrieves relevant paper sections for each evaluation criterion using semantic embeddings
- Structured LLM Analysis: Uses gpt-5 with Pydantic schemas for deterministic, parseable outputs
- Programmatic Scoring: Combines LLM-based text analysis with rules-based scoring algorithms
- Code Repository Analysis: Automatically ingests, analyzes, and embeds source code to evaluate reproducibility artifacts
- Workflow Orchestration: LangGraph-based DAG execution with database persistence and fault tolerance
This system enables:
- Large-scale conference analysis: Process hundreds of papers efficiently
- Consistent evaluation standards: Same criteria applied uniformly
- Actionable feedback: Specific recommendations for improving reproducibility
- Quantifiable metrics: Numerical scores for comparison and benchmarking
- Research insights: Understanding reproducibility trends across domains
- Paper Type Classification: Automatically identifies papers as dataset/method/both/theoretical
- Adaptive Scoring: Weights criteria based on paper type (e.g., datasets more important for dataset papers)
- Code Intelligence: LLM-guided selection of reproducibility-critical files from repositories
- Multi-Criterion Evaluation: 20 reproducibility criteria + 10 dataset documentation criteria + 6 code analysis components
- Paper-Level Analysis: Evaluates mathematical descriptions, experimental protocols, statistical reporting
- Code-Level Analysis: Checks for training code, evaluation scripts, checkpoints, dependencies
- Dataset Documentation: Assesses data collection, annotation protocols, ethical compliance
- Evidence-Based Scoring: Links each evaluation to specific paper sections
- Distributed Execution: Celery workers for parallel paper processing
- Database-Backed: MySQL persistence for all workflow states and results
- Fault Tolerant: Automatic retries, error isolation, partial result aggregation
- Token Tracking: Fine-grained cost accounting per workflow node
- PDF Upload: Direct paper upload with automatic text extraction (GROBID)
- Conference Scraping: Batch import papers from conference websites (MICCAI, etc.)
- Analysis Dashboard: View results, scores, and detailed criterion evaluations
- User Management: Profile-based tracking of analysis history
┌─────────────────────────────────────────────────────────┐
│ NGINX (Reverse Proxy) │
│ Port 80/443 (SSL via Let's Encrypt) │
└────────────────────┬────────────────────────────────────┘
│
┌────────────┴───────────────┐
│ │
┌────▼──────┐ ┌──────▼─────┐
│ Django │ │ Static │
│ Web │◄─────────────┤ Files │
│ (ASGI) │ └────────────┘
└─────┬─────┘
│
┌────┴─────────────┬──────────────┐
│ │ │
┌───▼────┐ ┌──────▼──────┐ ┌───▼────────┐
│ Celery │ │ MySQL │ │ Redis │
│Workers │◄────►│ Database │ │ (Broker) │
│ (3-5) │ │ (InnoDB) │ └────────────┘
└───┬────┘ └─────────────┘
│
└──► GROBID Server (PDF → TEI-XML)
└──► LLM APIs (OpenAI/LiteLLM)
| Component | Technology | Purpose |
|---|---|---|
| Web Framework | Django 5.2.7 | HTTP server, ORM, admin interface |
| Workflow Engine | LangGraph 1.0.6 | DAG-based workflow orchestration |
| Task Queue | Celery 5.x | Distributed async task execution |
| Message Broker | Redis 7 | Celery task queue backend |
| Database | MariaDB 11.7 | Persistent storage (MySQL 8.0 compatible) |
| Document Processing | GROBID 0.8.0 | PDF → structured XML extraction |
| Web Scraping | Crawl4AI 0.7.6 | Conference website data extraction |
| Code Ingestion | GitIngest 0.3.1 | Repository cloning and file extraction |
| LLM Integration | OpenAI SDK 2.7.2 | gpt-5 API calls with structured outputs |
| Embeddings | text-embedding-3-small | 1536-dim semantic vectors for RAG |
The analysis pipeline consists of 8 nodes executed as a directed acyclic graph (DAG):
┌─────────────────────────────────┐
│ A. Paper Type Classification │
│ (dataset/method/both/ │
│ theoretical) │
└──────────────┬──────────────────┘
│
┌──────────────▼──────────────────┐
│ D. Section Embeddings │
│ (text-embedding-3-small) │
└──────────────┬──────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌───────────▼──────────┐ ┌───▼────────────┐ ┌─▼─────────────────┐
│ G. Reproducibility │ │ H. Dataset │ │ B. Code │
│ Checklist │ │ Docs Check │ │ Availability │
│ (20 criteria) │ │ (10 crit.) │ │ Check │
└───────────┬──────────┘ └───┬────────────┘ └─┬─────────────────┘
│ │ │
│ │ ┌────────────▼──────────────┐
│ │ │ F. Code Embedding │
│ │ │ (repo ingestion) │
│ │ └────────────┬───────────────┘
│ │ │
│ │ ┌────────────▼──────────────┐
│ │ │ C. Code Repository │
│ │ │ Analysis │
│ │ └────────────┬───────────────┘
│ │ │
└──────────────────┴──────────────────┴──────────────┐
│
┌──────────────────────────────▼─┐
│ I. Final Aggregation │
│ (weighted scoring + LLM) │
└─────────────────────────────────┘
Node Responsibilities:
- Node A: Classify paper type using title + abstract
- Node B: Search for code URLs in paper (GitHub, GitLab, etc.)
- Node C: Analyze code repository structure and reproducibility artifacts
- Node D: Generate embeddings for all paper sections
- Node F: Ingest code repository, select critical files, embed chunks
- Node G: Evaluate 20 reproducibility criteria via multi-step RAG
- Node H: Evaluate 10 dataset documentation criteria
- Node I: Aggregate scores, generate qualitative assessment
- Docker: 24.0+ with Docker Compose V2
- Git: 2.30+
- Linux/macOS: Tested on Ubuntu 22.04+ and macOS 13+
You'll need an OpenAI API key with access to:
- gpt-5: For structured analysis (
gpt-5-2024-11-20or later) - text-embedding-3-small: For semantic embeddings
Minimum:
- 4 CPU cores
- 16 GB RAM
- 50 GB disk space
Recommended:
- 8 CPU cores
- 32 GB RAM
- 100 GB SSD storage
git clone https://github.com/yourusername/papersnitch.git
cd papersnitchCreate your local environment file:
cp .env.example .env.localEdit .env.local with your settings:
# Database Configuration
MYSQL_ROOT_PASSWORD=your_secure_root_password
MYSQL_DATABASE=papersnitch
MYSQL_USER=papersnitch
MYSQL_PASSWORD=your_secure_password
# Django Configuration
DJANGO_SECRET_KEY=your_very_long_random_secret_key_here
DJANGO_DEBUG=True
DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
# OpenAI API
OPENAI_API_KEY=sk-proj-your-api-key-here
# Celery Configuration
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_RESULT_BACKEND=redis://redis:6379/0
# GROBID Configuration (optional - uses public server by default)
GROBID_SERVER=https://cloud.science-miner.com/grobid
# Stack Configuration (auto-generated by create-dev-stack.sh)
STACK_SUFFIX=dev
HOST_PROJECT_PATH=/home/youruser/papersnitchSecurity Note: Never commit .env.local to version control!
Use the provided script to start all services:
./create-dev-stack.sh up 8000 devWhat this does:
- Finds available ports (8000 for Django, 3307 for MySQL, 6380 for Redis, 8071 for GROBID)
- Creates stack-specific directories (
mysql_dev,media_dev,static_dev) - Generates
.env.devwith port configuration - Starts Docker Compose services:
django-web-dev: Django application servermysql: MariaDB 11.7 databaseredis: Redis message brokercelery-worker: Background task processorcelery-beat: Periodic task scheduler
Expected output:
🚀 Starting development stack: dev
📍 Base port requested: 8000
✅ Available ports found:
Django: 8000
MySQL: 3307
Redis: 6380
GROBID: 8071
✅ Created .env.dev from .env.local with port config
🐳 Starting containers...
Wait for database to be healthy, then run migrations:
# Check if MySQL is ready
docker exec mysql-dev mariadb -u papersnitch -ppapersnitch -e "SELECT 1"
# Run Django migrations
docker exec django-web-dev python manage.py migrate
# Create superuser for admin access
docker exec -it django-web-dev python manage.py createsuperuserPre-compute embeddings for reproducibility criteria (one-time setup):
docker exec django-web-dev python manage.py initialize_criteria_embeddingsWhat this does:
- Creates embeddings for 20 reproducibility checklist criteria
- Creates embeddings for 10 dataset documentation criteria
- Stores in database for semantic retrieval during analysis
Access the web interface:
http://localhost:8000
Check service health:
# View all running containers
docker ps
# Check Django logs
docker logs django-web-dev
# Check Celery worker logs
docker logs celery-worker-dev# Start the stack
./create-dev-stack.sh up 8000 dev
# Stop the stack (preserves data)
./create-dev-stack.sh stop 8000 dev
# Stop and remove containers (preserves data)
./create-dev-stack.sh down 8000 dev
# View logs (all services)
./create-dev-stack.sh logs 8000 dev
# View specific service logs
docker logs -f django-web-dev
docker logs -f celery-worker-dev- Navigate to
http://localhost:8000 - Log in with superuser credentials
- Upload a PDF or paste arXiv URL
- Click "Analyze Reproducibility"
- View results in real-time as workflow executes
- Navigate to
http://localhost:8000/admin - Go to Papers → Add Paper
- Upload PDF and fill metadata
- Go to Workflow Runs → Add Workflow Run
- Select paper and workflow definition
- Save to trigger analysis
docker exec -it django-web-dev python manage.py shellfrom webApp.models import Paper, WorkflowDefinition
from webApp.services.workflow_orchestrator import WorkflowOrchestrator
# Get paper and workflow
paper = Paper.objects.first()
workflow_def = WorkflowDefinition.objects.get(
name="paper_processing_with_reproducibility",
version=8
)
# Create workflow run
orchestrator = WorkflowOrchestrator()
workflow_run = orchestrator.create_workflow_run(
workflow_definition=workflow_def,
paper=paper,
context_data={
"model": "gpt-5-2024-11-20",
"force_reprocess": False
}
)
print(f"Workflow run created: {workflow_run.id}")View workflow progress in Django admin at:
http://localhost:8000/admin/workflow_engine/workflowrun/
Or query the database:
docker exec -it mysql-dev mariadb -u papersnitch -ppapersnitch papersnitch
# Check workflow status
SELECT id, status, started_at, completed_at
FROM workflow_runs
ORDER BY created_at DESC LIMIT 10;
# Check node status
SELECT node_id, status, duration_seconds, input_tokens, output_tokens
FROM workflow_nodes
WHERE workflow_run_id = 'your-workflow-run-id'
ORDER BY started_at;PaperSnitch uses an 8-node DAG workflow to comprehensively evaluate research paper reproducibility:
-
Paper Type Classification (Node A): Determines if paper is dataset/method/both/theoretical using LLM analysis of title and abstract
-
Section Embeddings (Node D): Generates semantic embeddings for all paper sections (abstract, intro, methods, results, etc.) using
text-embedding-3-small -
Parallel Analysis:
- Reproducibility Checklist (Node G): Evaluates 20 criteria using multi-step RAG (retrieves relevant sections per criterion, then analyzes with LLM)
- Dataset Documentation (Node H): Evaluates 10 dataset-specific criteria
- Code Workflow (Nodes B→F→C):
- Node B: Searches for code repository URLs
- Node F: Ingests repo, LLM selects critical files, embeds all file chunks
- Node C: Analyzes repository structure, artifacts, and reproducibility
-
Final Aggregation (Node I): Combines all scores with adaptive weighting, generates qualitative assessment
Multi-Step RAG for Criterion Evaluation:
# For each criterion:
1. Retrieve top-3 most relevant paper sections via cosine similarity
2. Provide sections + criterion description to LLM
3. Get structured analysis (present/absent, confidence, evidence)
4. Aggregate 20 criterion analyses → category scores → overall scoreAdaptive Code Scoring:
# Scoring adapts to research methodology
if methodology == "deep_learning":
# Requires: training code + checkpoints + datasets
max_score_components = {
"code_completeness": 3.0,
"artifacts": 2.5, # Checkpoints critical
"dataset_splits": 2.0
}
elif methodology == "theoretical":
# Requires: implementation code only
max_score_components = {
"code_completeness": 2.5,
"artifacts": 0.5, # Checkpoints not applicable
"dataset_splits": 0.5
}LLM-Guided Code File Selection:
# Instead of embedding entire repository:
1. Extract README + file tree
2. LLM selects reproducibility-critical files (within 100k token budget)
3. Only embed selected files (20k char chunks)
4. Use embeddings for evidence-based component analysisFor detailed technical documentation, see TECHNICAL_DESCRIPTION_FOR_PAPER.md.
Key variables in .env.local:
# OpenAI API
OPENAI_API_KEY=sk-proj-...
DEFAULT_LLM_MODEL=gpt-5-2024-11-20
EMBEDDING_MODEL=text-embedding-3-small
# Database
MYSQL_DATABASE=papersnitch
MYSQL_USER=papersnitch
MYSQL_PASSWORD=your_password
# Celery
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_CONCURRENCY=8 # Tasks per worker
CELERY_MAX_TASKS_PER_CHILD=1 # Restart after 1 task
# Security
DJANGO_SECRET_KEY=...
DJANGO_DEBUG=True
DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1Modify criteria or scoring weights in Django admin or via shell:
from webApp.models import ReproducibilityChecklistCriterion
criterion = ReproducibilityChecklistCriterion.objects.get(
criterion_id="mathematical_description"
)
criterion.description = "Updated description..."
criterion.save()
# Regenerate embedding after modification
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
model="text-embedding-3-small",
input=f"{criterion.criterion_name}\n{criterion.description}"
)
criterion.embedding = response.data[0].embedding
criterion.save()Support for parallel development environments:
# Main dev stack on port 8000
./create-dev-stack.sh up 8000 dev
# Feature branch on port 8001
./create-dev-stack.sh up 8001 feature-x
# Personal stack on port 8002
./create-dev-stack.sh up 8002 my-name
# Each stack has isolated database, media files, and RedisDjango auto-reloads on code changes via Docker Compose watch mode.
# Create migration
docker exec django-web-dev python manage.py makemigrations
# Apply migrations
docker exec django-web-dev python manage.py migrate
# Rollback
docker exec django-web-dev python manage.py migrate workflow_engine 0001# All tests
docker exec django-web-dev python manage.py test
# Specific app
docker exec django-web-dev python manage.py test webApp.tests
# With coverage
docker exec django-web-dev coverage run manage.py test
docker exec django-web-dev coverage htmlPort Already in Use:
# Use different port
./create-dev-stack.sh up 8001 devMySQL Connection Refused:
# Check MySQL health
docker exec mysql-dev mariadb -u papersnitch -ppapersnitch -e "SELECT 1"
# Restart MySQL
docker restart mysql-devCelery Workers Not Processing:
# Check worker status
docker exec django-web-dev celery -A web inspect active
# Restart workers
docker restart celery-worker-devOpenAI Rate Limits:
# Reduce concurrency in compose.dev.yml:
command: celery -A web worker --concurrency=2Out of Memory:
# Increase Docker memory limit (Docker Desktop → Settings → Resources)
# Or reduce Celery concurrency
command: celery -A web worker --concurrency=2 --max-tasks-per-child=1# Check retrieval for specific paper
python debug_aspect_retrieval.py --paper-id 123 --aspect methodology
# List papers with embeddings
python debug_aspect_retrieval.py --list-papers
# Verify workflow installation
python verify_workflow_installation.py- TECHNICAL_DESCRIPTION_FOR_PAPER.md: Complete technical specification for academic paper
- WORKFLOW_ENGINE_DELIVERY.md: Workflow engine implementation details
- CODE_REPRODUCIBILITY_ANALYSIS.md: Code analysis node documentation
- DEPLOYMENT_CHECKLIST.md: Production deployment guide
- DOMAIN_SETUP_GUIDE.md: SSL and domain configuration
This project is licensed under the MIT License.
- GROBID: PDF text extraction
- LangGraph: Workflow orchestration
- OpenAI: LLM APIs
- Crawl4AI: Conference scraping
- GitIngest: Code repository ingestion
Built with ❤️ for the research community
Making reproducibility the norm, not the exception