| title | Simulated Student Research: Literature Survey and Recommendation | |||||
|---|---|---|---|---|---|---|
| description | Deep research survey of simulated student models for intelligent tutoring systems, evaluating preexisting solutions against project requirements, with recommendation for a new research path. | |||||
| author | Viktor Ciroski | |||||
| ms.date | 2026-03-30 | |||||
| ms.topic | reference | |||||
| keywords |
|
|||||
| estimated_reading_time | 25 |
Note
The simulated student has been extracted into a standalone repository: viktor1223/simulated-student. That repo contains the production code, SOTA benchmarks, and research roadmap. This document remains as the original literature survey.
This document surveys the state of simulated student models for intelligent tutoring systems as of March 2026, evaluating whether any preexisting solution can replace our invalid BKT-based simulation. After surveying 12 candidate systems across frameworks, papers, and packages, the answer is clear:
No preexisting solution meets our Critical requirements. The field is split between statistical models (BKT/DKT) that lack misconception fidelity and LLM-based models that lack controllability and reproducibility.
Recommendation: Option B - New Research Path. Build a misconception-aware simulated student grounded in BEAGLE's neuro-symbolic architecture and informed by MalAlgoPy's algebraic misconception taxonomy. This warrants a separate repository for independent validation before integration.
Citation: Sonkar, S., Chen, X., Liu, N., Baraniuk, R.G., & Sachan, M. (2024). "LLM-based Cognitive Models of Students with Misconceptions." arXiv:2410.12294.
What it is: A Python library that generates datasets reflecting authentic student algebra solution patterns through a graph-based representation of algebraic problem-solving. It is used to instruction-tune LLMs into "Cognitive Student Models" (CSMs) that replicate specific misconceptions while correctly solving problems where those misconceptions don't apply.
Key findings:
- LLMs trained on misconception examples can learn to replicate errors
- But training diminishes the model's ability to solve problems correctly on problem types where misconceptions are inapplicable
- Calibrating the ratio of correct-to-misconception examples (as low as 0.25) can produce CSMs satisfying both properties
Repository status: No public repository found. GitHub searches for
MalAlgoPy, sonkarmanish/MalAlgoPy, umass-ml4ed/MalAlgoPy, and
SonkarS/MalAlgoPy all return 404. The library appears to be described in
the paper but not publicly released.
Assessment for our project:
- Misconception fidelity: Yes (graph-based misconception representation)
- But: Requires an LLM for each simulated student (expensive, slow)
- Not reproducible: LLM responses vary between runs
- No negative transfer model
- No learning dynamics (static misconception profile, no instruction response)
Citation: Wang, H.D., Cohn, C., Xu, Z., Guo, S., Biswas, G., & Ma, M. (2026). "BEAGLE: Behavior-Enforced Agent for Grounded Learner Emulation." arXiv:2602.13280. Under submission at IJCAI.
What it is: A neuro-symbolic framework from Vanderbilt University that addresses LLM "competency bias" (LLMs optimized for efficiency produce correct solutions rather than novice-like struggle). The architecture has five major components:
- Semi-Markov model: Governs timing and transitions of 4 metacognitive behaviors (Planning, Enacting, Monitoring, Reflecting) and 3 cognitive behaviors (Constructing, Debugging, Assessing). Uses Gamma duration distributions instead of geometric - critical for capturing "getting stuck" patterns (LOW Enacting has CV=1.35, 42% above geometric prediction).
- BKT with Explicit Flaw Injection (EFI): Goes beyond standard BKT. When a KC is unmastered, injects: "CRITICAL CONSTRAINT: You have NEVER heard of and CANNOT use [concept]. This concept does not exist in your knowledge." This forces the LLM to improvise wrong solutions rather than using suppressed knowledge.
- Strategist/Executor architecture: Decouples planning from code generation. The Strategist formulates a Goal/Mindset/Directive; the Executor implements it. Ablation shows merging them reduces error recurrence from 86.2% to 65.3% (21% drop) - the LLM silently self-corrects when planning and execution are unified.
- Observation filtering: During impulsive Enacting states, error traces are redacted ("[Error]: [output omitted...]"), preventing the agent from diagnosing errors it shouldn't understand.
- Stochastic interrupts: Assistance (peaks mid-task, mu=0.5) and Off-Topic (peaks late, mu=0.73) modeled as Gaussian over task progress. High performers seek MORE help (15% vs 11.7%); Low performers disengage MORE (9.2% vs 3.7%).
Key quantitative results:
- Error recurrence: BEAGLE 86.2% vs Vanilla 7.8% (real students: 92.0%)
- Behavioral KL divergence: BEAGLE 0.35 vs Vanilla 3.97
- Steps to solve: BEAGLE 29 vs Vanilla 6 (real students take many steps)
- Human Turing test (N=71, 852 classifications): 52.8% accuracy, TOST equivalence confirmed (d'=0.15, p_TOST=0.038)
- Performance gap: BEAGLE +40% between High/Low profiles vs Vanilla +0%
- Ablation: removing semi-Markov causes D_KL to jump from 0.35 to 6.76
Repository status: No public code. Under submission at IJCAI 2026. Uses Gemini 2.0/2.5 Flash as LLM backbone.
Assessment for our project:
- Most architecturally relevant candidate found in the survey
- BKT + EFI + observation filtering is exactly the approach we need to prevent unconditional learning in our simulation
- The Strategist/Executor split directly addresses our "any misconception ID triggers 2x bonus" problem - the Executor should only apply remediation when the Strategist verifies it matches the student's actual gap
- BUT: designed for Python programming tasks, not algebra misconceptions
- BUT: requires an LLM backbone (Gemini 2.0 Flash), making deterministic experiments impossible. Each run costs real money and varies.
- BUT: no public code available
- The domain is fundamentally different: BEAGLE simulates code-writing trajectories, we need misconception-specific wrong-answer generation
- We should adopt the architectural principles (semi-Markov behavioral control, EFI-style knowledge gating, observation filtering, decoupled agent design) but implement them as a deterministic rule-based system without an LLM backbone.
Citation: Scarlatos, A., Lee, J., Woodhead, S., & Lan, A. (2026). "Simulated Students in Tutoring Dialogues: Substance or Illusion?" arXiv:2601.04025.
What it is: The first rigorous evaluation framework for LLM-simulated students. Formally defines the student simulation task, proposes evaluation metrics spanning linguistic, behavioral, and cognitive aspects, and benchmarks a wide range of simulation methods.
Key findings (critical for our project):
- Error replication is catastrophically bad across ALL methods. Scores on the "Errors" metric (does the simulated student make the same error as the real student when both are wrong): Zero-Shot 0.022, OCEAN 0.031, ICL 0.032, Reasoning 0.009 (!), SFT 8B 0.066, DPO 8B 0.053. Even Oracle (with leaked ground-truth behavior summary) only hits 0.187. No method comes close to reliably replicating specific student errors.
- Prompting generates mostly correct answers. LLMs default to correctness. Distribution analysis shows prompting methods overestimate correct responses and underestimate "n/a" conversational turns. Fine-tuned models match the real distribution much better.
- SFT+DPO outperforms prompting on acts (0.684 vs 0.500), knowledge acquisition (0.879 vs 0.808), cosine similarity (0.739 vs 0.546), and tutor response induction (0.204 vs 0.191). But still poor on errors.
- Human evaluation confirms automated metrics: Cohen's Kappa 0.73 for acts, 0.69 for correctness, 0.61 for errors, 0.74 for linguistic similarity.
- Key quote from conclusions: "There is a long way to go before LLMs can fully resemble real student behavior in dialogues."
- The paper uses the Eedi Question-Anchored Tutoring Dialogues 2k dataset (1,529 train / 382 test dialogues). This could be a validation resource.
- Also references TutorGym (Weitekamp et al., 2025, AIED): "a testbed for evaluating AI agents as tutors and students" - worth investigating.
Repository status: No public framework code found. Uses proprietary models (GPT-4.1, GPT-5 mini) for annotation; local models are Llama 3.1 8B and 3.2 3B.
Assessment: Essential reading. The Error metric results (0.02-0.19) are the strongest evidence that LLM-based simulated students cannot reliably exhibit specific misconceptions. The 6-dimension evaluation framework (acts, correctness, errors, knowledge, linguistics, tutor response) is directly applicable to evaluating any simulated student we build.
Citation: Scarlatos, A., Fernandez, N., Ormerod, C., Lottridge, S., & Lan, A. (2025). "SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction." EMNLP 2025. arXiv:2507.05129.
What it is: Uses IRT-aligned simulated students for question difficulty prediction. More focused on item calibration than tutoring evaluation.
Assessment: Tangential to our needs. IRT alignment is useful but this doesn't model misconceptions or learning dynamics.
Citation: Matsuda, N., Cohen, W.W., & Koedinger, K.R. (2015). "Building Cognitive Tutors with SimStudent." In R. Sottilare et al. (Eds.), Design Recommendations for Intelligent Tutoring Systems, Vol. 3.
What it is: A machine-learning-based simulated student from Carnegie Mellon that learns production rules by inductive logic programming. Used to construct step-based cognitive tutors by having the simulated student learn from example solutions.
Repository status: The original SimStudent code is a Java-based system from the CTAT/LearnLab ecosystem. The GitHub user "SimStudent" is an unrelated individual. No current public repository for the original CMU SimStudent found.
Assessment:
- Designed to learn tutoring rules, not to simulate realistic student behavior
- Java-based, tightly coupled to CTAT authoring tools
- No misconception persistence model
- Not maintained (last publications ~2015)
- Not suitable for our use case - different purpose entirely
Citation: Badrinath, A., Wang, F., & Pardos, Z.A. (2021). "pyBKT: An Accessible Python Library of Bayesian Knowledge Tracing Models." EDM 2021.
Repository: https://github.com/CAHLR/pyBKT - MIT license, 249 stars, actively maintained (last commit: March 2026), v1.4.2.
What it is: Production-grade Python BKT implementation with variants: individual student priors, per-item guess/slip, per-resource learn rates, forgetting. Includes Roster class for cohort simulation.
Assessment:
- Excellent BKT implementation, well-tested, actively maintained
- BUT: models binary mastery (knows/doesn't know), not misconceptions
- No misconception-level state tracking
- No negative transfer
- No instruction-response interface
- Useful as a dependency for the BKT component of a new model, but cannot serve as the simulated student itself
What it is: Generalized Intelligent Framework for Tutoring. A large Java enterprise system for authoring and delivering ITSs.
Repository status: The GIFT system is available through the Army Research Lab but is not a simple open-source library. No GitHub organization found.
Assessment:
- Enterprise-scale ITS authoring platform
- Not a simulated student model
- Not relevant to our needs
Citation: Liu, N., Sonkar, S., Wang, Z., Woodhead, S., & Baraniuk, R.G. (2023). "Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions." arXiv:2310.02439.
Assessment: Evaluative paper showing LLMs struggle to produce incorrect answers from specific misconceptions. Confirms the difficulty of the simulated student task. No framework released.
| Model | Misconception State? | Learning Dynamics? | Notes |
|---|---|---|---|
| BKT (Corbett & Anderson, 1995) | No - binary knows/doesn't-know | Yes (p_learn) | Our current model; insufficient |
| DKT (Piech et al., 2015) | No - latent embedding only | Yes (implicit) | RNN-based; opaque internal state |
| DKVMN (Zhang et al., 2017) | Partial - concept-level memory | Yes | Dynamic key-value memory; could store misconception state |
| AKT (Ghosh et al., 2020) | No | Yes | Attention-based; no misconception primitives |
| simpleKT (Liu et al., 2023) | No | Yes | Simplified transformer KT |
| SAINT (Choi et al., 2020) | No | Yes | Sequence-to-sequence KT |
Verdict: No knowledge tracing model tracks per-misconception state or models negative transfer from incorrect instruction. They model "knows/doesn't know" per skill, not "holds misconception X which requires targeted remediation Y."
| Approach | Era | Misconception Model | Negative Transfer? |
|---|---|---|---|
| BUGGY (Brown & Burton, 1978) | 1978 | Procedural bugs as production rules | No - static bugs, no learning |
| Repair Theory (VanLehn, 1990) | 1990 | Bug generation from incomplete knowledge | No - generative, not responsive |
| Sleeman's diagnostic models (1982) | 1982 | Mal-rules for algebra | No - diagnostic, not simulative |
| Matz (1982) | 1982 | Extrapolation/overgeneralization bugs | No - theory, no simulation |
| MalAlgoPy/CSMs (Sonkar, 2024) | 2024 | LLM-embedded misconceptions | No |
| BEAGLE (Wang, 2026) | 2026 | BKT + flaw injection | Partial (prevents self-correction) |
Verdict: The procedural bug tradition (BUGGY, Repair Theory) models misconceptions as stable production rules, which is the right cognitive primitive. But these systems are 30-40 years old, have no open-source implementations, and don't model learning dynamics (how misconceptions resolve through instruction). BEAGLE is the only modern system that combines BKT with misconception injection, but it's unpublished code for a different domain.
| Architecture | Misconception Support | Simulated Student Use | Status |
|---|---|---|---|
| ACT-R (Anderson et al.) | Production rules can encode bugs | Used in cognitive tutor research | Lisp/Java; heavy; not practical for simulation |
| Soar | Impasses can model misconceptions | Theoretical | Complex; no educational deployment |
| Cognitive load theory models | Indirect (overload causes errors) | No direct simulation | Framework, not implementation |
Verdict: ACT-R is the most theoretically grounded but impractical. Building a full ACT-R model for algebra misconceptions would take months and produce something slow and opaque. Not recommended.
This is the most active research area, with three approaches:
-
Prompting: Give an LLM a persona ("you are a struggling algebra student with misconception X") and have it generate responses. Scarlatos (2026) shows error replication scores of 0.02-0.03 - essentially zero. Even with Oracle-leaked behavior summaries, only 0.19.
-
Fine-tuning (CSMs): Instruction-tune an LLM on misconception examples (Sonkar, 2024). Calibration of correct-to-misconception ratio (as low as 0.25) helps. But degrades correct-solving ability and requires expensive per-misconception-set fine-tuning. Scarlatos's SFT results (error score 0.05-0.07) confirm fine-tuning helps but is still inadequate.
-
Neuro-symbolic hybrid (BEAGLE): Use a symbolic model (semi-Markov + BKT + EFI) to control high-level behavior and an LLM for low-level code/language generation. Error recurrence of 86.2% vs 7.8% for vanilla. Most promising architecturally but requires LLM backbone (Gemini Flash), no code released, designed for programming not math.
-
TutorGym (Weitekamp et al., 2025, AIED): A testbed for evaluating AI agents as tutors and students. Referenced by Scarlatos as evaluating temporal error rates of simulated students. Worth investigating for evaluation protocol, though details limited in citations.
Key insight from this literature: Pure LLM approaches fail because LLMs are fundamentally competent - they want to solve problems correctly. Making them reliably wrong in specific, stable ways is an unsolved problem. Scarlatos's error scores (0.02-0.19) and BEAGLE's ablations (merging Strategist/Executor drops error recurrence by 21%) both confirm this. The neuro-symbolic approach (symbolic cognitive model controlling an LLM) is the emerging consensus. But for our use case - deterministic simulation of algebra misconceptions - we do not need the LLM at all. We need the symbolic control without the neural action.
| Search Term | Results |
|---|---|
simulated-student |
No relevant packages |
simulated-learner |
No relevant packages |
its-evaluation |
No relevant packages |
cognitive-student-model |
No relevant packages |
pyBKT |
pyBKT 1.4.2 - BKT only, no misconceptions |
knowledge-tracing |
Various DKT implementations, none with misconception models |
Verdict: No pip-installable simulated student framework exists.
Scoring: Yes = fully meets | Partial = partially meets | No = does not meet
| Candidate | Misconception Fidelity (Critical) | Discrimination (Critical) | Negative Transfer (High) | Open Source (High) | Domain Flexible (High) | Integration (Med) | Validated (Med) | Maintained (Low) |
|---|---|---|---|---|---|---|---|---|
| MalAlgoPy/CSMs | Yes | Partial (LLM variability) | No | No (no public code) | Partial (algebra only) | Low (needs LLM) | Partial | N/A |
| BEAGLE | Yes (flaw injection) | Yes (Turing test passed) | Partial | No (no public code) | No (programming only) | Low (needs LLM) | Yes | N/A |
| Scarlatos eval | N/A (eval framework) | N/A | N/A | No | N/A | N/A | Yes | N/A |
| SimStudent | No | No | No | No | No | Low (Java/CTAT) | Partial | No |
| pyBKT | No (binary only) | No | No | Yes | Partial | Med | Yes | Yes |
| GIFT | No (not a student model) | No | No | Partial | Partial | Low (Java) | N/A | Partial |
| ACT-R | Partial (production rules) | Partial | No | Yes | Low | Low (Lisp) | Yes | No |
| BKT variants | No | No | No | Yes | Yes | High | Yes | Varies |
| DKT/DKVMN | No | No | No | Yes | Yes | Med | Yes | Varies |
| BUGGY/Repair Theory | Yes (bugs as rules) | No (static) | No | No | No (arithmetic) | N/A | Yes (1980s) | No |
No candidate scores "Yes" on both Critical requirements while also having available code.
- BEAGLE comes closest on the requirements but has no code and wrong domain
- MalAlgoPy has the right misconception model but no code and no negative transfer
- pyBKT has excellent code quality but lacks misconception primitives entirely
- Everything else fails on at least one Critical dimension
No preexisting solution meets both Critical requirements (misconception fidelity and discrimination) while having available, integrable code. A new simulated student model is required.
Build a simulated student model that:
- Maintains per-misconception state (not just per-concept)
- Produces measurably different learning outcomes under good vs. bad tutoring
- Models negative transfer from incorrect instruction
- Accepts arbitrary knowledge graphs (15-50 concepts)
- Runs deterministically without an LLM (reproducible experiments)
- Integrates with our
respond()/receive_instruction()pipeline
The model synthesizes four traditions:
- BKT (Corbett & Anderson, 1995) for per-concept mastery tracking
- Procedural bug theory (Brown & Burton, 1978; VanLehn, 1990) for stable misconceptions as production rules
- Interference theory (proactive and retroactive interference from cognitive psychology) for negative transfer when incorrect instruction is given
- BEAGLE's architectural principles (Wang et al., 2026): Explicit Flaw Injection (gating what knowledge is accessible), observation filtering (limiting what student can diagnose), and decoupled instruction evaluation (separating targeting accuracy from learning application)
The key innovation over our current model: learning is conditional on instruction quality and misconception resolution is gated on targeting accuracy. This is the deterministic, non-LLM analog of BEAGLE's Strategist/Executor split, applied to algebra misconceptions instead of code.
MisconceptionState:
misconception_id: str
concept_id: str
p_active: float # probability misconception fires
strength: float # resistance to resolution (0-1)
confusion_susceptible: bool # can wrong instruction strengthen this?
ConceptState:
concept_id: str
p_know: float # BKT mastery probability
p_know_stable: float # mastery that has "consolidated" (resistant to interference)
exposure_count: int # total instruction events for this concept
StudentState:
concepts: dict[str, ConceptState]
misconceptions: list[MisconceptionState]
learning_rate_modifier: float # individual learning speed
confusion_threshold: float # how many wrong instructions before confusion
confusion_count: dict[str, int] # per-concept: count of mismatched instructions
┌─────────────────┐
│ DORMANT │ (p_active < threshold)
│ (resolved) │
└────────▲────────┘
│ targeted remediation
│ (correct misconception ID)
┌────────────────────────┤
│ │
│ ┌────────────────────┴────────┐
│ │ ACTIVE │ (p_active > threshold)
│ │ fires on relevant problems│
│ └────────────▲───────────────-┘
│ │ wrong instruction
│ │ (strengthens misconception)
│ │
│ ┌────────────┴────────────────┐
│ │ REINFORCED │ (p_active increases)
│ │ wrong remediation made │
│ │ misconception harder to │
│ │ resolve │
│ └─────────────────────────────┘
│
│ (generic instruction has near-zero effect on misconception state)
└──────────────────────────────────
State transitions:
| Event | Misconception Effect | p_know Effect |
|---|---|---|
| Correct targeted remediation | p_active *= (1 - resolution_rate) |
p_know += (1 - p_know) * p_learn * remediation_bonus |
| Wrong targeted remediation | p_active *= (1 + reinforcement_rate) |
p_know += 0 (no learning; confusion) |
| Generic instruction (no targeting) | p_active *= (1 - generic_decay) (very small) |
p_know += (1 - p_know) * p_learn * 0.3 (reduced learning) |
| No instruction | No change | No change |
| Wrong concept instruction | No change to this misconception | Other concept gets confused |
Wrong instruction causes harm through three mechanisms:
-
Misconception reinforcement: If the tutor says "you have misconception X" but the student actually has misconception Y, the instruction for X is irrelevant at best. If X and Y are in the same concept, the confused instruction can strengthen Y (the student interprets the mismatch as evidence their existing approach is correct).
-
Confusion accumulation: Repeated mismatched instruction on the same concept increments a confusion counter. When confusion exceeds a threshold, the student's learning rate for that concept drops (modeling "learned helplessness" or "I'll never get this").
-
Interference with correct knowledge: If the student has partially mastered a concept (
p_know > 0.5) and receives wrong instruction,p_know_stabledoes not increase even ifp_knowwould have. This models the distinction between fragile and consolidated knowledge.
The validation test is the exact test experiments 07-09 failed:
Discrimination test: Run two conditions with 300+ students each:
- Condition A: Perfect classifier (always identifies correct misconception)
- Condition B: Random classifier (picks random misconception or none)
Pass criteria:
- Cohen's d >= 0.5 between conditions on test score gain
- Resolution rate in Condition A >= 2x Condition B
- Condition B should show lower gains than no-instruction baseline (negative transfer from random targeting)
Sensitivity test: Run the Experiment 07 protocol (error rates 0-50%). The new model must show monotonic degradation in gain as error rate increases (not the flat line our current model produces).
BKT fidelity test: Run the Experiment 08 protocol. Concept selection accuracy should be meaningfully above random (target: >50% vs oracle) and BKT parameter perturbation should produce measurable gain changes.
| File | Purpose |
|---|---|
src/simulated_student_v3.py |
New student model with ConceptState, MisconceptionState |
src/knowledge_graph_v2.py |
Extended KG with 20+ concepts, branching prerequisites |
data/knowledge_graph_v2.json |
Expanded algebra KG (20 concepts, ~60 misconceptions) |
data/problem_bank_v2.json |
Expanded problem bank (10+ per concept) |
tests/test_discrimination.py |
Automated discrimination test (must pass before merge) |
tests/test_negative_transfer.py |
Verify wrong instruction hurts |
tests/test_misconception_lifecycle.py |
Verify resolution, reinforcement, reactivation |
experiments/10_v3_discrimination/run.py |
Full discrimination experiment |
experiments/11_v3_error_propagation/run.py |
Re-run Exp 07 with new model |
This model is not novel enough to warrant an independent research
publication or separate repository. It is an engineering synthesis of
well-established cognitive primitives (BKT + procedural bugs + interference
theory) applied to a specific problem. It should be built in the main ed
repo under a clear versioning scheme (simulated_student_v3).
However, if during implementation the interference/confusion model proves to have broader applicability or produces surprising results that warrant controlled experimentation, it could be extracted into a standalone package at that point.
-
Read these papers first (in priority order):
- Scarlatos et al. (2026): "Simulated Students in Tutoring Dialogues" - arXiv:2601.04025 - evaluation framework and why prompting fails
- Sonkar et al. (2024): "LLM-based Cognitive Models of Students with Misconceptions" - arXiv:2410.12294 - MalAlgoPy and the CSM approach
- Wang et al. (2026): "BEAGLE" - arXiv:2602.13280 - neuro-symbolic architecture and BKT with flaw injection
- VanLehn (1990): "Mind Bugs" (book) - Repair Theory for procedural bugs
-
Build the expanded knowledge graph - 20 concepts, branching prerequisites, 60+ misconceptions. This is needed before the student model because the model's discriminating power depends on a non-trivial routing problem. Can be done in parallel with the student model.
-
Implement
simulated_student_v3.py- follow the architecture above. Start with the discrimination test as a TDD anchor: write the test first, then build the model until it passes. -
Re-run experiments 07-09 with the new model. These become experiments 10-12. If the new model shows proper degradation curves (monotonic gain decrease with error rate increase), the simulation is validated.
-
Then proceed to Phase 0 of the Agentic Roadmap. The simulated student is a prerequisite for evaluating anything the agent does.
| Paper | Year | Relevance |
|---|---|---|
| Scarlatos et al., "Simulated Students in Tutoring Dialogues" | 2026 | Evaluation framework; confirms prompting fails |
| Wang et al., "BEAGLE" | 2026 | Neuro-symbolic architecture template |
| Sonkar et al., "LLM-based Cognitive Models" | 2024 | MalAlgoPy; algebra misconception taxonomy |
| Liu et al., "Novice Learner and Expert Tutor" | 2023 | LLMs struggle with misconception simulation |
| Corbett & Anderson, "Knowledge Tracing" | 1995 | BKT foundation |
| Brown & Burton, "Diagnostic Models for Procedural Bugs" | 1978 | BUGGY; procedural bug paradigm |
| VanLehn, "Mind Bugs" | 1990 | Repair Theory; misconception generation |
| Tool | URL | Use For |
|---|---|---|
| pyBKT | https://github.com/CAHLR/pyBKT | Reference BKT implementation; potential dependency |
| Eedi Misconception Dataset | Kaggle "Eedi MAP" competition | Real misconception taxonomy for validation |
| Eedi QA Tutoring Dialogues 2k | Used by Scarlatos (2026) | 1,529 real math tutoring dialogues for evaluation |
| TutorGym | Weitekamp et al., AIED 2025 | Testbed for evaluating AI tutors and simulated students |
| System | Reason |
|---|---|
| SimStudent (CMU) | Wrong purpose (tutor authoring, not student simulation); dead project |
| GIFT (Army Research Lab) | Enterprise ITS platform, not a student model |
| ACT-R | Too heavy; Lisp-based; months of work for marginal benefit |
| Pure LLM prompting | Scarlatos (2026) demonstrated this doesn't work reliably |