Skip to content

viktor1223/simulated-student

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SimStudent

A misconception-aware simulated student for evaluating intelligent tutoring systems. Unlike standard BKT models that track binary mastery, SimStudent models how students learn (or fail to learn) depending on the quality of instruction they receive.

Built for the ed tutoring system evaluation pipeline.

SOTA Benchmark Comparison

What makes this different

Most simulated students treat learning as unconditional: give the student instruction, mastery goes up. Real students are not like that. If a tutor misidentifies a student's misconception and teaches the wrong thing, the student does not learn. They may actually get worse.

SimStudent implements three-branch conditional learning:

Architecture

Instruction Quality Effect on p_know Effect on Misconceptions Grounding
Correct targeting +2.5x learning bonus 50% resolution per session BKT + targeted remediation
Wrong targeting No change +20% reinforcement, confusion accumulates Interference theory (Anderson, 1983)
Generic (no targeting) +0.2x minimal learning 1% passive resolution Baseline BKT

This means the simulated student can serve as a discrimination instrument: a tutoring system with a good misconception classifier will produce measurably better outcomes than one with a bad classifier.

SOTA benchmark results

We benchmarked against every published metric from the simulated student literature where a meaningful comparison can be constructed. Full methodology and caveats documented in experiments/sota_benchmarks/EXPERIMENT_NOTES.md.

Benchmark Our V3 SOTA Reference Source Notes
Error Recurrence Rate 52.9% 86.2% BEAGLE (Wang et al., 2026) Different domain (Python vs algebra), stochastic vs deterministic
Accuracy Gap (High vs Low) +43.6 pct pts +40% BEAGLE (Wang et al., 2026) Exceeds BEAGLE on absolute performance differentiation
Learning Curve R^2 (p_know) 0.999 > 0.90 Power Law of Practice Near-perfect fit to established learning curves
Misconception Stability 67.6% ~variable Scarlatos et al. (2026) Stable misconceptions with stochastic activation (vs LLM chaos)
Response Prediction AUC 0.641 0.63-0.72 BKT literature Directly in published BKT range
Sessions to Resolution 5.0 (median) 3-7 Cognitive tutor literature Dead center of established range
Instruction Sensitivity (d) 2.15 N/A Our contribution No published sim student measures this
Negative Transfer Detected Not modeled Our contribution No published sim student models this

Novel contributions

Two metrics have no precedent in the simulated student literature:

  1. Instruction sensitivity (Cohen's d = 2.15 between perfect and always-wrong tutoring). The simulated student produces measurably different learning outcomes depending on instruction quality. This is the core property that makes it useful as an evaluation instrument.

  2. Negative transfer (switch-design experiment). Wrong instruction actively harms the student: misconceptions get reinforced, confusion accumulates, and learning rate degrades. This is grounded in interference theory but has never been implemented in a simulated student.

Data

Knowledge graph

20 algebra concepts across 7 prerequisite levels, with 56 documented misconceptions. Each misconception includes worked examples with expected wrong answers.

Knowledge Graph

Source: Extended from the MaE Dataset (Otero et al., 2024) with standard algebra curriculum alignment. The misconceptions are drawn from documented student errors in the mathematics education literature.

Problem bank

120 problems (6 per concept: 2 easy, 2 medium, 2 hard) with correct answers. Schema:

{
    "problem_id": "leq_1",
    "concept": "solving_linear_equations",
    "difficulty": "easy",
    "problem_text": "Solve: x + 5 = 12",
    "correct_answer": "x = 7"
}

Student archetypes

Students are generated from 5 archetypes that reflect realistic classroom distributions:

Archetype Weight Initial p_know Misconceptions Description
strong_overall 15% 0.55-0.80 1-2 High performers with few gaps
strong_arith_weak_algebra 25% Varies by level 3-6 Good at arithmetic, weak at algebra
specific_gap 20% 0.40-0.70 2-5 Generally capable with 2 concept gaps
weak_overall 20% 0.05-0.25 5-8 Struggling across the board
random_mixed 20% 0.10-0.65 3-6 Mixed abilities

Installation

git clone https://github.com/viktor1223/simulated-student.git
cd simulated-student
pip install -e .

Quick start

from simstudent import (
    SimulatedStudentV3,
    KnowledgeGraph,
    StudentState,
    generate_students_v3,
    load_problem_bank_v2,
)

# Generate 100 students with varied profiles
students = generate_students_v3(n=100, seed=42)

# Load the problem bank
bank = load_problem_bank_v2()

# Pick a student and give them a problem
student = students[0]
problem = bank["solving_linear_equations"][0]

# Get response
response = student.respond(problem)
print(response)
# {'student_response': 'x = 17', 'correct': False, 'misconception_used': 'leq_reverse_operation'}

# Provide instruction targeting the detected misconception
student.receive_instruction(
    concept_id="solving_linear_equations",
    targeted_misconception="leq_reverse_operation",  # correct targeting
)

# Or provide wrong targeting (simulates a bad classifier)
student.receive_instruction(
    concept_id="solving_linear_equations",
    targeted_misconception="leq_divide_wrong_direction",  # wrong targeting
)

Running the experiments

Discrimination test

Validates that the student differentiates between perfect, random, wrong, and no instruction. All 6 criteria must pass.

python experiments/discrimination/run.py

Discrimination Test

SOTA benchmarks

Runs all 8 benchmarks against published literature. Generates comparison plots and detailed results JSON.

python experiments/sota_benchmarks/run.py

How comparisons are made

Transparency about methodology is critical. Each benchmark comparison has different characteristics:

Comparison Type Caveats
vs BEAGLE (Wang 2026) Analogous metrics, different domain BEAGLE uses Gemini LLM on Python programming tasks. We use deterministic rules on algebra. Error recurrence is not directly comparable because BEAGLE's BKT+EFI architecture deterministically blocks correct solutions, while our BKT is stochastic.
vs BKT literature Same underlying model Our student IS a BKT model, so BKT prediction working well is expected. The value is that AUC falls in the realistic range (not too high, not too low).
vs Cognitive tutor lit Operationally analogous Sessions to resolution maps to "opportunities to mastery" in cognitive tutor studies. Not identical (we use idealized targeting; real systems have imperfect classifiers).
vs Scarlatos (2026) Qualitative comparison Scarlatos shows LLMs produce variable error types between runs. Our model produces consistent error types with stochastic activation probability. Not numerically comparable.
vs Power Law of Practice Direct fit comparison We fit both power-law and exponential models. The p_know curve (R^2 = 0.999) is the correct comparison metric. The accuracy curve (R^2 = 0.264) is confounded by adaptive concept selection.
Instruction sensitivity Novel (no precedent) No published simulated student measures this. Our d = 2.15 establishes a baseline.
Negative transfer Novel (no precedent) No published simulated student models this. Grounded in interference theory (Anderson, 1983; Baddeley, 1976).

For detailed per-benchmark methodology, see experiments/sota_benchmarks/EXPERIMENT_NOTES.md.

Repository structure

simulated-student/
  simstudent/               Core Python package
    __init__.py
    student.py              SimulatedStudentV3 (conditional learning, negative transfer)
    knowledge_graph.py      KnowledgeGraph, BKT StudentState, next_action planner
  data/
    knowledge_graph.json    20 concepts, 56 misconceptions, 7 levels
    problem_bank.json       120 problems (6 per concept)
  experiments/
    discrimination/         Discrimination test (4 conditions, Cohen's d = 1.60)
    sota_benchmarks/        8 SOTA benchmarks with detailed methodology notes
  docs/
    literature_survey.md    12-system literature survey
    research_roadmap.md     What's needed for a full paper
    figures/                Generated figures for README
  tests/

Theoretical grounding

Component Theory Citation
Mastery tracking Bayesian Knowledge Tracing Corbett & Anderson (1995)
Misconception persistence Procedural bug theory Brown & Burton (1978)
Negative transfer Interference theory Anderson (1983); Baddeley (1976)
Architecture principles Knowledge gating, conditional learning BEAGLE (Wang et al., 2026)
Misconception taxonomy Algebraic error patterns Extended from MaE Dataset (Otero et al., 2024)

Current limitations

This is a research prototype, not a validated instrument. Key limitations:

  • No real student validation. All benchmarks compare against published structural properties. The critical missing piece is a pilot study with real students to measure how closely our simulated behaviors match actual student traces.
  • Single domain. Currently limited to middle school algebra. Generalizability to other domains is untested.
  • No forgetting model. The current BKT implementation does not model knowledge decay over time gaps.
  • Idealized misconceptions. All misconceptions are pre-defined from the literature. Real students may exhibit novel error patterns not in our taxonomy.

See docs/research_roadmap.md for the path to a full publication.

License

MIT

About

Misconception-aware simulated student for evaluating intelligent tutoring systems

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages