SimStudent

A misconception-aware simulated student for evaluating intelligent tutoring systems. Unlike standard BKT models that track binary mastery, SimStudent models how students learn (or fail to learn) depending on the quality of instruction they receive.

Built for the ed tutoring system evaluation pipeline.

What makes this different

Most simulated students treat learning as unconditional: give the student instruction, mastery goes up. Real students are not like that. If a tutor misidentifies a student's misconception and teaches the wrong thing, the student does not learn. They may actually get worse.

SimStudent implements three-branch conditional learning:

Instruction Quality	Effect on p_know	Effect on Misconceptions	Grounding
Correct targeting	+2.5x learning bonus	50% resolution per session	BKT + targeted remediation
Wrong targeting	No change	+20% reinforcement, confusion accumulates	Interference theory (Anderson, 1983)
Generic (no targeting)	+0.2x minimal learning	1% passive resolution	Baseline BKT

This means the simulated student can serve as a discrimination instrument: a tutoring system with a good misconception classifier will produce measurably better outcomes than one with a bad classifier.

SOTA benchmark results

We benchmarked against every published metric from the simulated student literature where a meaningful comparison can be constructed. Full methodology and caveats documented in experiments/sota_benchmarks/EXPERIMENT_NOTES.md.

Benchmark	Our V3	SOTA Reference	Source	Notes
Error Recurrence Rate	52.9%	86.2%	BEAGLE (Wang et al., 2026)	Different domain (Python vs algebra), stochastic vs deterministic
Accuracy Gap (High vs Low)	+43.6 pct pts	+40%	BEAGLE (Wang et al., 2026)	Exceeds BEAGLE on absolute performance differentiation
Learning Curve R^2 (p_know)	0.999	> 0.90	Power Law of Practice	Near-perfect fit to established learning curves
Misconception Stability	67.6%	~variable	Scarlatos et al. (2026)	Stable misconceptions with stochastic activation (vs LLM chaos)
Response Prediction AUC	0.641	0.63-0.72	BKT literature	Directly in published BKT range
Sessions to Resolution	5.0 (median)	3-7	Cognitive tutor literature	Dead center of established range
Instruction Sensitivity (d)	2.15	N/A	Our contribution	No published sim student measures this
Negative Transfer	Detected	Not modeled	Our contribution	No published sim student models this

Novel contributions

Two metrics have no precedent in the simulated student literature:

Instruction sensitivity (Cohen's d = 2.15 between perfect and always-wrong tutoring). The simulated student produces measurably different learning outcomes depending on instruction quality. This is the core property that makes it useful as an evaluation instrument.
Negative transfer (switch-design experiment). Wrong instruction actively harms the student: misconceptions get reinforced, confusion accumulates, and learning rate degrades. This is grounded in interference theory but has never been implemented in a simulated student.

Data

Knowledge graph

20 algebra concepts across 7 prerequisite levels, with 56 documented misconceptions. Each misconception includes worked examples with expected wrong answers.

Source: Extended from the MaE Dataset (Otero et al., 2024) with standard algebra curriculum alignment. The misconceptions are drawn from documented student errors in the mathematics education literature.

Problem bank

120 problems (6 per concept: 2 easy, 2 medium, 2 hard) with correct answers. Schema:

{
    "problem_id": "leq_1",
    "concept": "solving_linear_equations",
    "difficulty": "easy",
    "problem_text": "Solve: x + 5 = 12",
    "correct_answer": "x = 7"
}

Student archetypes

Students are generated from 5 archetypes that reflect realistic classroom distributions:

Archetype	Weight	Initial p_know	Misconceptions	Description
strong_overall	15%	0.55-0.80	1-2	High performers with few gaps
strong_arith_weak_algebra	25%	Varies by level	3-6	Good at arithmetic, weak at algebra
specific_gap	20%	0.40-0.70	2-5	Generally capable with 2 concept gaps
weak_overall	20%	0.05-0.25	5-8	Struggling across the board
random_mixed	20%	0.10-0.65	3-6	Mixed abilities

Installation

git clone https://github.com/viktor1223/simulated-student.git
cd simulated-student
pip install -e .

Quick start

from simstudent import (
    SimulatedStudentV3,
    KnowledgeGraph,
    StudentState,
    generate_students_v3,
    load_problem_bank_v2,
)

# Generate 100 students with varied profiles
students = generate_students_v3(n=100, seed=42)

# Load the problem bank
bank = load_problem_bank_v2()

# Pick a student and give them a problem
student = students[0]
problem = bank["solving_linear_equations"][0]

# Get response
response = student.respond(problem)
print(response)
# {'student_response': 'x = 17', 'correct': False, 'misconception_used': 'leq_reverse_operation'}

# Provide instruction targeting the detected misconception
student.receive_instruction(
    concept_id="solving_linear_equations",
    targeted_misconception="leq_reverse_operation",  # correct targeting
)

# Or provide wrong targeting (simulates a bad classifier)
student.receive_instruction(
    concept_id="solving_linear_equations",
    targeted_misconception="leq_divide_wrong_direction",  # wrong targeting
)

Running the experiments

Discrimination test

Validates that the student differentiates between perfect, random, wrong, and no instruction. All 6 criteria must pass.

python experiments/discrimination/run.py

SOTA benchmarks

Runs all 8 benchmarks against published literature. Generates comparison plots and detailed results JSON.

python experiments/sota_benchmarks/run.py

How comparisons are made

Transparency about methodology is critical. Each benchmark comparison has different characteristics:

Comparison	Type	Caveats
vs BEAGLE (Wang 2026)	Analogous metrics, different domain	BEAGLE uses Gemini LLM on Python programming tasks. We use deterministic rules on algebra. Error recurrence is not directly comparable because BEAGLE's BKT+EFI architecture deterministically blocks correct solutions, while our BKT is stochastic.
vs BKT literature	Same underlying model	Our student IS a BKT model, so BKT prediction working well is expected. The value is that AUC falls in the realistic range (not too high, not too low).
vs Cognitive tutor lit	Operationally analogous	Sessions to resolution maps to "opportunities to mastery" in cognitive tutor studies. Not identical (we use idealized targeting; real systems have imperfect classifiers).
vs Scarlatos (2026)	Qualitative comparison	Scarlatos shows LLMs produce variable error types between runs. Our model produces consistent error types with stochastic activation probability. Not numerically comparable.
vs Power Law of Practice	Direct fit comparison	We fit both power-law and exponential models. The p_know curve (R^2 = 0.999) is the correct comparison metric. The accuracy curve (R^2 = 0.264) is confounded by adaptive concept selection.
Instruction sensitivity	Novel (no precedent)	No published simulated student measures this. Our d = 2.15 establishes a baseline.
Negative transfer	Novel (no precedent)	No published simulated student models this. Grounded in interference theory (Anderson, 1983; Baddeley, 1976).

For detailed per-benchmark methodology, see experiments/sota_benchmarks/EXPERIMENT_NOTES.md.

Repository structure

simulated-student/
  simstudent/               Core Python package
    __init__.py
    student.py              SimulatedStudentV3 (conditional learning, negative transfer)
    knowledge_graph.py      KnowledgeGraph, BKT StudentState, next_action planner
  data/
    knowledge_graph.json    20 concepts, 56 misconceptions, 7 levels
    problem_bank.json       120 problems (6 per concept)
  experiments/
    discrimination/         Discrimination test (4 conditions, Cohen's d = 1.60)
    sota_benchmarks/        8 SOTA benchmarks with detailed methodology notes
  docs/
    literature_survey.md    12-system literature survey
    research_roadmap.md     What's needed for a full paper
    figures/                Generated figures for README
  tests/

Theoretical grounding

Component	Theory	Citation
Mastery tracking	Bayesian Knowledge Tracing	Corbett & Anderson (1995)
Misconception persistence	Procedural bug theory	Brown & Burton (1978)
Negative transfer	Interference theory	Anderson (1983); Baddeley (1976)
Architecture principles	Knowledge gating, conditional learning	BEAGLE (Wang et al., 2026)
Misconception taxonomy	Algebraic error patterns	Extended from MaE Dataset (Otero et al., 2024)

Current limitations

This is a research prototype, not a validated instrument. Key limitations:

No real student validation. All benchmarks compare against published structural properties. The critical missing piece is a pilot study with real students to measure how closely our simulated behaviors match actual student traces.
Single domain. Currently limited to middle school algebra. Generalizability to other domains is untested.
No forgetting model. The current BKT implementation does not model knowledge decay over time gaps.
Idealized misconceptions. All misconceptions are pre-defined from the literature. Real students may exhibit novel error patterns not in our taxonomy.

See docs/research_roadmap.md for the path to a full publication.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
experiments		experiments
simstudent		simstudent
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimStudent

What makes this different

SOTA benchmark results

Novel contributions

Data

Knowledge graph

Problem bank

Student archetypes

Installation

Quick start

Running the experiments

Discrimination test

SOTA benchmarks

How comparisons are made

Repository structure

Theoretical grounding

Current limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimStudent

What makes this different

SOTA benchmark results

Novel contributions

Data

Knowledge graph

Problem bank

Student archetypes

Installation

Quick start

Running the experiments

Discrimination test

SOTA benchmarks

How comparisons are made

Repository structure

Theoretical grounding

Current limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages