opentests

A college research project on evaluating AI language models for education. We built an open-source platform that benchmarks how well different AI models perform on academic tasks — and lets the community vote on what "good" actually means.

Objective

The core question we wanted to answer: Which AI model should students and educators trust with their learning?

Static benchmarks don't tell the full story. A model might score well on a fixed test set but fail in real classroom scenarios. So we built two complementary evaluation systems:

Golden Tests — Automated tests that evaluate models on two things that matter in education:
- Raw Intelligence: Can the model find correct answers to academic questions (physics, chemistry, math)?
- Visualization: Can the model represent information in helpful, visual ways (diagrams, structured explanations)?
The Arena — A community-driven comparison system where users submit the same prompt to two AI models, see both responses side-by-side, and vote on which is better. These votes feed into an ELO rating system that ranks models based on real human preferences, not synthetic benchmarks.

Tech Stack

Layer	Technology
Web App	Next.js 16, React 19, Tailwind CSS 4, Recharts, TanStack Form
CLI / TUI	OpenTUI (`@opentui/core`, `@opentui/react`), Bun
Backend	Convex (serverless), Vercel AI SDK, Groq
Tooling	TypeScript (strict), Bun workspaces, Biome

Key Decisions

Why ELO?

We chose the ELO rating system — originally designed for chess — because it solves a fundamental problem in model evaluation: how do you fairly rank competitors when they don't all face the same opponents?

ELO works by predicting the expected outcome of a match between two players and adjusting their ratings based on whether the result was expected or surprising. If a lower-rated model beats a higher-rated one, it gains more points than if it beats an equally-rated model. Over many rounds, the ratings converge to reflect true relative skill.

For our use case, this means:

Models don't need to be tested on the same prompts to be compared
Upsets are rewarded more, which accelerates rating convergence
The system is self-correcting — a single bad vote has minimal impact
Everyone can see exactly how rankings are calculated (convex/votes.ts:61-69)

We use a K-factor of 32, which is the standard for amateur chess and provides a good balance between responsiveness and stability.

Why a TUI for Golden Tests?

The golden test runner runs dozens (sometimes hundreds) of model inferences in sequence — each one taking seconds. We built this as a Terminal UI instead of a web app for three reasons:

Long-running processes belong in the terminal. Golden tests can take 30-40 minutes to complete across 8 models. The terminal is the natural environment for long-running batch jobs — it persists, it's scriptable, and it doesn't time out or lose state if you close a browser tab.
Researchers live in the terminal. Our target users — educators, ML researchers, and students running evaluations — are already comfortable with CLI tools. A TUI gives them the speed of a CLI with the readability of a GUI.
Lower overhead, faster iteration. No browser rendering, no HTTP polling, no WebSocket connections. The TUI streams results directly to the terminal in real-time, making it easier to monitor progress, debug failures, and pipe output to files for later analysis.

The web app, on the other hand, is where users interactively compare models and vote — tasks that benefit from rich visual layouts, charts, and side-by-side text rendering.

Project Structure

opentests/
├── apps/
│   ├── web/                 # Next.js web application
│   │   ├── src/
│   │   │   ├── app/         # Pages: test/, visualization/, info/
│   │   │   ├── components/  # Reusable UI components
│   │   │   └── lib/         # Utilities, model catalog
│   │   └── package.json
│   │
│   └── cli/                 # Terminal UI application
│       ├── src/
│       │   ├── index.tsx    # CLI entry point
│       │   └── goldenTest/  # Automated test runner
│       └── package.json
│
├── convex/                  # Backend (serverless)
│   ├── schema.ts            # Database schema
│   ├── rounds.ts            # Model comparison rounds
│   ├── votes.ts             # Voting + ELO calculation
│   ├── stats.ts             # Leaderboard queries
│   └── model.ts             # Model configurations
│
├── src/
│   └── goldenTest/          # Shared golden test code
│
└── package.json             # Root workspace config

Setup Guide

Prerequisites

Bun — JavaScript runtime and package manager. Install from bun.sh
Git — For cloning the repository
API Keys — You'll need inference API keys from providers like Groq

Step 1: Clone the Repository

git clone https://github.com/your-username/opentests.git
cd opentests

Step 2: Install Dependencies

bun install

This installs dependencies for all packages in the monorepo (web app, CLI, and shared code).

Step 3: Configure Environment Variables

Create a .env.local file in the root directory:

GROQ_API_KEY=your_groq_api_key_here

Add only the providers you plan to use.

Step 4: Start the Backend

In a separate terminal, start the Convex dev server:

npx convex dev

This runs the backend locally and generates types into convex/_generated/.

Step 5: Run the Web App

bun run dev:web

The web app will be available at http://localhost:3000. Use it to compare models side-by-side, vote on responses, and view ELO leaderboards.

Step 6: Run the CLI

bun run dev:cli

Use the CLI to run automated golden tests that evaluate models on raw intelligence and visualization capabilities.

Running Tests

# Full golden test suite (~30-40 min, 8 models)
bun run test:golden

# Quick validation test (~1 min, 2 models)
bun run test:validate

Adding a New Model

Edit convex/model.ts and add your model to the AVAILABLE_MODELS array:

{
  name: "Model Display Name",
  slug: "provider/model-slug",
}

License

This is a college research project. Use it freely for research, education, or your own evaluations.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.agents/skills		.agents/skills
.opencode/skills		.opencode/skills
apps		apps
convex		convex
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
bun.lock		bun.lock
details.txt		details.txt
overview.md		overview.md
package.json		package.json
skills-lock.json		skills-lock.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opentests

Objective

Tech Stack

Key Decisions

Why ELO?

Why a TUI for Golden Tests?

Project Structure

Setup Guide

Prerequisites

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Configure Environment Variables

Step 4: Start the Backend

Step 5: Run the Web App

Step 6: Run the CLI

Running Tests

Adding a New Model

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

opentests

Objective

Tech Stack

Key Decisions

Why ELO?

Why a TUI for Golden Tests?

Project Structure

Setup Guide

Prerequisites

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Configure Environment Variables

Step 4: Start the Backend

Step 5: Run the Web App

Step 6: Run the CLI

Running Tests

Adding a New Model

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages