A college research project on evaluating AI language models for education. We built an open-source platform that benchmarks how well different AI models perform on academic tasks — and lets the community vote on what "good" actually means.
The core question we wanted to answer: Which AI model should students and educators trust with their learning?
Static benchmarks don't tell the full story. A model might score well on a fixed test set but fail in real classroom scenarios. So we built two complementary evaluation systems:
-
Golden Tests — Automated tests that evaluate models on two things that matter in education:
- Raw Intelligence: Can the model find correct answers to academic questions (physics, chemistry, math)?
- Visualization: Can the model represent information in helpful, visual ways (diagrams, structured explanations)?
-
The Arena — A community-driven comparison system where users submit the same prompt to two AI models, see both responses side-by-side, and vote on which is better. These votes feed into an ELO rating system that ranks models based on real human preferences, not synthetic benchmarks.
| Layer | Technology |
|---|---|
| Web App | Next.js 16, React 19, Tailwind CSS 4, Recharts, TanStack Form |
| CLI / TUI | OpenTUI (@opentui/core, @opentui/react), Bun |
| Backend | Convex (serverless), Vercel AI SDK, Groq |
| Tooling | TypeScript (strict), Bun workspaces, Biome |
We chose the ELO rating system — originally designed for chess — because it solves a fundamental problem in model evaluation: how do you fairly rank competitors when they don't all face the same opponents?
ELO works by predicting the expected outcome of a match between two players and adjusting their ratings based on whether the result was expected or surprising. If a lower-rated model beats a higher-rated one, it gains more points than if it beats an equally-rated model. Over many rounds, the ratings converge to reflect true relative skill.
For our use case, this means:
- Models don't need to be tested on the same prompts to be compared
- Upsets are rewarded more, which accelerates rating convergence
- The system is self-correcting — a single bad vote has minimal impact
- Everyone can see exactly how rankings are calculated (
convex/votes.ts:61-69)
We use a K-factor of 32, which is the standard for amateur chess and provides a good balance between responsiveness and stability.
The golden test runner runs dozens (sometimes hundreds) of model inferences in sequence — each one taking seconds. We built this as a Terminal UI instead of a web app for three reasons:
-
Long-running processes belong in the terminal. Golden tests can take 30-40 minutes to complete across 8 models. The terminal is the natural environment for long-running batch jobs — it persists, it's scriptable, and it doesn't time out or lose state if you close a browser tab.
-
Researchers live in the terminal. Our target users — educators, ML researchers, and students running evaluations — are already comfortable with CLI tools. A TUI gives them the speed of a CLI with the readability of a GUI.
-
Lower overhead, faster iteration. No browser rendering, no HTTP polling, no WebSocket connections. The TUI streams results directly to the terminal in real-time, making it easier to monitor progress, debug failures, and pipe output to files for later analysis.
The web app, on the other hand, is where users interactively compare models and vote — tasks that benefit from rich visual layouts, charts, and side-by-side text rendering.
opentests/
├── apps/
│ ├── web/ # Next.js web application
│ │ ├── src/
│ │ │ ├── app/ # Pages: test/, visualization/, info/
│ │ │ ├── components/ # Reusable UI components
│ │ │ └── lib/ # Utilities, model catalog
│ │ └── package.json
│ │
│ └── cli/ # Terminal UI application
│ ├── src/
│ │ ├── index.tsx # CLI entry point
│ │ └── goldenTest/ # Automated test runner
│ └── package.json
│
├── convex/ # Backend (serverless)
│ ├── schema.ts # Database schema
│ ├── rounds.ts # Model comparison rounds
│ ├── votes.ts # Voting + ELO calculation
│ ├── stats.ts # Leaderboard queries
│ └── model.ts # Model configurations
│
├── src/
│ └── goldenTest/ # Shared golden test code
│
└── package.json # Root workspace config
- Bun — JavaScript runtime and package manager. Install from bun.sh
- Git — For cloning the repository
- API Keys — You'll need inference API keys from providers like Groq
git clone https://github.com/your-username/opentests.git
cd opentestsbun installThis installs dependencies for all packages in the monorepo (web app, CLI, and shared code).
Create a .env.local file in the root directory:
GROQ_API_KEY=your_groq_api_key_hereAdd only the providers you plan to use.
In a separate terminal, start the Convex dev server:
npx convex devThis runs the backend locally and generates types into convex/_generated/.
bun run dev:webThe web app will be available at http://localhost:3000. Use it to compare models side-by-side, vote on responses, and view ELO leaderboards.
bun run dev:cliUse the CLI to run automated golden tests that evaluate models on raw intelligence and visualization capabilities.
# Full golden test suite (~30-40 min, 8 models)
bun run test:golden
# Quick validation test (~1 min, 2 models)
bun run test:validateEdit convex/model.ts and add your model to the AVAILABLE_MODELS array:
{
name: "Model Display Name",
slug: "provider/model-slug",
}This is a college research project. Use it freely for research, education, or your own evaluations.