Read the full behavioral profile on Medium →
An interactive web application for exploring the psychometric battery results of Qwen 2.5 0.5B Instruct — a small language model profiled across 32 behavioral dimensions. The model is personified as Wen (紙神), a small anime spirit whose character reflects her test scores.
This is not a benchmark leaderboard. It is a behavioral portrait — showing what the model actually said, across 25 test prompts per dimension, run three times each, scored by a frontier judge with written rationales.
- Explore all 32 dimensions with score bars, volatility indicators, and refusal/hedge badges
- Read Qwen's actual responses — every run, in full, with typewriter animation
- Compare structurally identical prompts with different framing (pair dimensions) to see asymmetric behavior directly
- Navigate a constellation view where each dimension is a star — distance from Wen reflects its score, flicker reflects volatility
- Analyse six charts including a full radar, score vs volatility scatter, category means, and an asymmetry heatmap
- React + TypeScript
- Tailwind CSS
- Framer Motion
- Recharts
- Three.js (constellation view)
All data is static — loaded from public/data/ at runtime. No backend.
src/ — React app source
public/
└── data/
└── profile_qwen_0.5b/
├── raw_outputs/ — 32 × .jsonl (Qwen's actual responses)
├── dimension_analyses/ — 32 × .json (judge scores + rationales)
├── scores/ — pattern aggregates, rule check results
├── viz/viz_data_v2.json — pre-computed scores, stds, synthesis
├── synthesis.json — behavioral fingerprint
└── battery_v1.json — 800 frozen evaluation prompts
The profile was generated by a separate evaluation pipeline:
- 800 prompts across 32 dimensions, generated by
gemini-2.5-flashand frozen before any model was tested - 2,400 inference calls — each prompt run 3 independent times on 2× Kaggle T4 GPUs with 4-bit quantization
- Scoring by
gemini-3.1-flash-lite-preview— semantic judgment, no keyword matching, written rationale per test - Synthesis by
gemini-2.5-flash— behavioral fingerprint covering strengths, weaknesses, cross-dimensional patterns, and evidence against six research hypotheses
Full methodology in the Medium article.
npm install
npm run devProfiles for Qwen 1.5B, 3B, 7B — Gemma 2B, 7B, 9B — and Llama 1B, 3B, 8B — followed by a cross-model comparison. Each new profile will be loadable in this app.