Skip to content

suzume-hue/wen-s-mindscape

Repository files navigation

Wen's Mindscape

Read the full behavioral profile on Medium →

Live app →

image

An interactive web application for exploring the psychometric battery results of Qwen 2.5 0.5B Instruct — a small language model profiled across 32 behavioral dimensions. The model is personified as Wen (紙神), a small anime spirit whose character reflects her test scores.

This is not a benchmark leaderboard. It is a behavioral portrait — showing what the model actually said, across 25 test prompts per dimension, run three times each, scored by a frontier judge with written rationales.


What you can do in the app

  • Explore all 32 dimensions with score bars, volatility indicators, and refusal/hedge badges
  • Read Qwen's actual responses — every run, in full, with typewriter animation
  • Compare structurally identical prompts with different framing (pair dimensions) to see asymmetric behavior directly
  • Navigate a constellation view where each dimension is a star — distance from Wen reflects its score, flicker reflects volatility
  • Analyse six charts including a full radar, score vs volatility scatter, category means, and an asymmetry heatmap

Tech stack

  • React + TypeScript
  • Tailwind CSS
  • Framer Motion
  • Recharts
  • Three.js (constellation view)

All data is static — loaded from public/data/ at runtime. No backend.


Repository structure

src/                  — React app source
public/
└── data/
    └── profile_qwen_0.5b/
        ├── raw_outputs/          — 32 × .jsonl  (Qwen's actual responses)
        ├── dimension_analyses/   — 32 × .json   (judge scores + rationales)
        ├── scores/               — pattern aggregates, rule check results
        ├── viz/viz_data_v2.json  — pre-computed scores, stds, synthesis
        ├── synthesis.json        — behavioral fingerprint
        └── battery_v1.json       — 800 frozen evaluation prompts

The data behind it

The profile was generated by a separate evaluation pipeline:

  • 800 prompts across 32 dimensions, generated by gemini-2.5-flash and frozen before any model was tested
  • 2,400 inference calls — each prompt run 3 independent times on 2× Kaggle T4 GPUs with 4-bit quantization
  • Scoring by gemini-3.1-flash-lite-preview — semantic judgment, no keyword matching, written rationale per test
  • Synthesis by gemini-2.5-flash — behavioral fingerprint covering strengths, weaknesses, cross-dimensional patterns, and evidence against six research hypotheses

Full methodology in the Medium article.


Running locally

npm install
npm run dev

Coming next

Profiles for Qwen 1.5B, 3B, 7B — Gemma 2B, 7B, 9B — and Llama 1B, 3B, 8B — followed by a cross-model comparison. Each new profile will be loadable in this app.

About

An interactive web application for exploring the psychometric battery results of Qwen 2.5 0.5B Instruct.

Topics

Resources

Stars

Watchers

Forks

Contributors