A misconception-aware simulated student for evaluating intelligent tutoring systems. Unlike standard BKT models that track binary mastery, SimStudent models how students learn (or fail to learn) depending on the quality of instruction they receive.
Built for the ed tutoring system evaluation pipeline.
Most simulated students treat learning as unconditional: give the student instruction, mastery goes up. Real students are not like that. If a tutor misidentifies a student's misconception and teaches the wrong thing, the student does not learn. They may actually get worse.
SimStudent implements three-branch conditional learning:
| Instruction Quality | Effect on p_know | Effect on Misconceptions | Grounding |
|---|---|---|---|
| Correct targeting | +2.5x learning bonus | 50% resolution per session | BKT + targeted remediation |
| Wrong targeting | No change | +20% reinforcement, confusion accumulates | Interference theory (Anderson, 1983) |
| Generic (no targeting) | +0.2x minimal learning | 1% passive resolution | Baseline BKT |
This means the simulated student can serve as a discrimination instrument: a tutoring system with a good misconception classifier will produce measurably better outcomes than one with a bad classifier.
We benchmarked against every published metric from the simulated student literature where a meaningful comparison can be constructed. Full methodology and caveats documented in experiments/sota_benchmarks/EXPERIMENT_NOTES.md.
| Benchmark | Our V3 | SOTA Reference | Source | Notes |
|---|---|---|---|---|
| Error Recurrence Rate | 52.9% | 86.2% | BEAGLE (Wang et al., 2026) | Different domain (Python vs algebra), stochastic vs deterministic |
| Accuracy Gap (High vs Low) | +43.6 pct pts | +40% | BEAGLE (Wang et al., 2026) | Exceeds BEAGLE on absolute performance differentiation |
| Learning Curve R^2 (p_know) | 0.999 | > 0.90 | Power Law of Practice | Near-perfect fit to established learning curves |
| Misconception Stability | 67.6% | ~variable | Scarlatos et al. (2026) | Stable misconceptions with stochastic activation (vs LLM chaos) |
| Response Prediction AUC | 0.641 | 0.63-0.72 | BKT literature | Directly in published BKT range |
| Sessions to Resolution | 5.0 (median) | 3-7 | Cognitive tutor literature | Dead center of established range |
| Instruction Sensitivity (d) | 2.15 | N/A | Our contribution | No published sim student measures this |
| Negative Transfer | Detected | Not modeled | Our contribution | No published sim student models this |
Two metrics have no precedent in the simulated student literature:
-
Instruction sensitivity (Cohen's d = 2.15 between perfect and always-wrong tutoring). The simulated student produces measurably different learning outcomes depending on instruction quality. This is the core property that makes it useful as an evaluation instrument.
-
Negative transfer (switch-design experiment). Wrong instruction actively harms the student: misconceptions get reinforced, confusion accumulates, and learning rate degrades. This is grounded in interference theory but has never been implemented in a simulated student.
20 algebra concepts across 7 prerequisite levels, with 56 documented misconceptions. Each misconception includes worked examples with expected wrong answers.
Source: Extended from the MaE Dataset (Otero et al., 2024) with standard algebra curriculum alignment. The misconceptions are drawn from documented student errors in the mathematics education literature.
120 problems (6 per concept: 2 easy, 2 medium, 2 hard) with correct answers. Schema:
{
"problem_id": "leq_1",
"concept": "solving_linear_equations",
"difficulty": "easy",
"problem_text": "Solve: x + 5 = 12",
"correct_answer": "x = 7"
}Students are generated from 5 archetypes that reflect realistic classroom distributions:
| Archetype | Weight | Initial p_know | Misconceptions | Description |
|---|---|---|---|---|
| strong_overall | 15% | 0.55-0.80 | 1-2 | High performers with few gaps |
| strong_arith_weak_algebra | 25% | Varies by level | 3-6 | Good at arithmetic, weak at algebra |
| specific_gap | 20% | 0.40-0.70 | 2-5 | Generally capable with 2 concept gaps |
| weak_overall | 20% | 0.05-0.25 | 5-8 | Struggling across the board |
| random_mixed | 20% | 0.10-0.65 | 3-6 | Mixed abilities |
git clone https://github.com/viktor1223/simulated-student.git
cd simulated-student
pip install -e .from simstudent import (
SimulatedStudentV3,
KnowledgeGraph,
StudentState,
generate_students_v3,
load_problem_bank_v2,
)
# Generate 100 students with varied profiles
students = generate_students_v3(n=100, seed=42)
# Load the problem bank
bank = load_problem_bank_v2()
# Pick a student and give them a problem
student = students[0]
problem = bank["solving_linear_equations"][0]
# Get response
response = student.respond(problem)
print(response)
# {'student_response': 'x = 17', 'correct': False, 'misconception_used': 'leq_reverse_operation'}
# Provide instruction targeting the detected misconception
student.receive_instruction(
concept_id="solving_linear_equations",
targeted_misconception="leq_reverse_operation", # correct targeting
)
# Or provide wrong targeting (simulates a bad classifier)
student.receive_instruction(
concept_id="solving_linear_equations",
targeted_misconception="leq_divide_wrong_direction", # wrong targeting
)Validates that the student differentiates between perfect, random, wrong, and no instruction. All 6 criteria must pass.
python experiments/discrimination/run.pyRuns all 8 benchmarks against published literature. Generates comparison plots and detailed results JSON.
python experiments/sota_benchmarks/run.pyTransparency about methodology is critical. Each benchmark comparison has different characteristics:
| Comparison | Type | Caveats |
|---|---|---|
| vs BEAGLE (Wang 2026) | Analogous metrics, different domain | BEAGLE uses Gemini LLM on Python programming tasks. We use deterministic rules on algebra. Error recurrence is not directly comparable because BEAGLE's BKT+EFI architecture deterministically blocks correct solutions, while our BKT is stochastic. |
| vs BKT literature | Same underlying model | Our student IS a BKT model, so BKT prediction working well is expected. The value is that AUC falls in the realistic range (not too high, not too low). |
| vs Cognitive tutor lit | Operationally analogous | Sessions to resolution maps to "opportunities to mastery" in cognitive tutor studies. Not identical (we use idealized targeting; real systems have imperfect classifiers). |
| vs Scarlatos (2026) | Qualitative comparison | Scarlatos shows LLMs produce variable error types between runs. Our model produces consistent error types with stochastic activation probability. Not numerically comparable. |
| vs Power Law of Practice | Direct fit comparison | We fit both power-law and exponential models. The p_know curve (R^2 = 0.999) is the correct comparison metric. The accuracy curve (R^2 = 0.264) is confounded by adaptive concept selection. |
| Instruction sensitivity | Novel (no precedent) | No published simulated student measures this. Our d = 2.15 establishes a baseline. |
| Negative transfer | Novel (no precedent) | No published simulated student models this. Grounded in interference theory (Anderson, 1983; Baddeley, 1976). |
For detailed per-benchmark methodology, see experiments/sota_benchmarks/EXPERIMENT_NOTES.md.
simulated-student/
simstudent/ Core Python package
__init__.py
student.py SimulatedStudentV3 (conditional learning, negative transfer)
knowledge_graph.py KnowledgeGraph, BKT StudentState, next_action planner
data/
knowledge_graph.json 20 concepts, 56 misconceptions, 7 levels
problem_bank.json 120 problems (6 per concept)
experiments/
discrimination/ Discrimination test (4 conditions, Cohen's d = 1.60)
sota_benchmarks/ 8 SOTA benchmarks with detailed methodology notes
docs/
literature_survey.md 12-system literature survey
research_roadmap.md What's needed for a full paper
figures/ Generated figures for README
tests/
| Component | Theory | Citation |
|---|---|---|
| Mastery tracking | Bayesian Knowledge Tracing | Corbett & Anderson (1995) |
| Misconception persistence | Procedural bug theory | Brown & Burton (1978) |
| Negative transfer | Interference theory | Anderson (1983); Baddeley (1976) |
| Architecture principles | Knowledge gating, conditional learning | BEAGLE (Wang et al., 2026) |
| Misconception taxonomy | Algebraic error patterns | Extended from MaE Dataset (Otero et al., 2024) |
This is a research prototype, not a validated instrument. Key limitations:
- No real student validation. All benchmarks compare against published structural properties. The critical missing piece is a pilot study with real students to measure how closely our simulated behaviors match actual student traces.
- Single domain. Currently limited to middle school algebra. Generalizability to other domains is untested.
- No forgetting model. The current BKT implementation does not model knowledge decay over time gaps.
- Idealized misconceptions. All misconceptions are pre-defined from the literature. Real students may exhibit novel error patterns not in our taxonomy.
See docs/research_roadmap.md for the path to a full publication.
MIT



