| title | Agentic Diagnostic System: Technical Design Document | |||||
|---|---|---|---|---|---|---|
| description | Architecture specification for a domain-agnostic agentic learning platform that transforms misconception detection into adaptive, closed-loop intervention intelligence. Written for handoff to implementation teams. | |||||
| author | Viktor Ciroski | |||||
| ms.date | 2026-03-29 | |||||
| ms.topic | concept | |||||
| keywords |
|
|||||
| estimated_reading_time | 45 |
The current system detects student misconceptions with 91.1% accuracy across 19 types and displays them on a teacher dashboard. Every intervention is a hardcoded one-liner. Every problem recommendation selects the three easiest items. The teacher sees the problem but receives no personalized guidance on how to resolve it.
The gap is detection without prescription.
This document specifies the architecture for evolving the static diagnostic display into a closed-loop agentic system that recommends interventions, learns which approaches work for which students, adaptively sequences problems, and coaches teachers with actionable intelligence. The architecture is designed to be domain-agnostic: the core reasoning engine is the same whether the subject is algebra, reading comprehension, or music theory. Only the domain configuration (knowledge graph, misconception taxonomy, intervention catalog, problem bank, and classifier) changes per subject.
Important
Every design decision in this document obeys one inviolable constraint: the teacher remains in the loop. The system recommends; the teacher decides. No auto-assigning, no auto-messaging parents, no decisions without teacher review.
The system separates into four architectural layers: a domain-agnostic agentic core, a per-subject domain configuration, an event-sourced data store, and institution-specific frontends.
graph TB
subgraph Frontend["Frontend (Institution-Specific)"]
TD[Teacher Dashboard]
SW[Student Workbench]
end
subgraph API["API Gateway (FastAPI)"]
AUTH[Auth Middleware]
REST[REST Endpoints]
end
subgraph Core["Agentic Core (Domain-Agnostic)"]
direction TB
DE[Diagnostic Engine]
IM[Intervention Manager]
LP[Learning Profiler]
PS[Problem Sequencer]
CA[Coaching Agent]
PA[Pattern Analyzer]
end
subgraph Domain["Domain Configuration (Per-Subject)"]
KG[Knowledge Graph]
MT[Misconception Taxonomy]
IC[Intervention Catalog]
PB[Problem Bank + IRT Params]
CL[Classifier Plugin]
end
subgraph Data["Event Store"]
EL[Event Log - Append Only]
MV[Materialized Views]
IL[Intervention Log]
end
TD --> REST
SW --> REST
REST --> AUTH
AUTH --> DE
AUTH --> IM
AUTH --> LP
AUTH --> PS
AUTH --> CA
AUTH --> PA
DE --> KG
DE --> MT
DE --> CL
IM --> IC
IM --> IL
PS --> PB
PS --> KG
CA --> LP
CA --> PA
PA --> EL
DE --> EL
IM --> EL
LP --> EL
EL --> MV
Each agentic core component is a Python module with a defined interface. The domain configuration is a set of JSON files and an optional classifier model, loaded at startup and hot-swappable per tenant. The event store is the single source of truth; all mutable state (mastery levels, escalation states, learning profiles) is materialized from the append-only event log.
These constraints apply to every component in every phase. Violating any of them is a blocking issue in code review.
- The teacher remains in the loop. The system recommends; the teacher decides. No auto-assigning, no auto-messaging parents, no decisions without teacher review.
- No student-facing AI. Students interact with problems and a text box. Intelligence surfaces through the teacher, never directly to the student.
- Privacy-first, research-enabled. All personally identifiable information (names, emails, school identifiers) stays on-premise or in the school's tenant. No PII leaves the institution. However, the system collects de-identified learning analytics (response patterns, mastery trajectories, intervention outcomes, misconception distributions) for research and system improvement. Schools opt in to this data sharing through an IRB-approved consent process. De-identified data is stripped of all direct and indirect identifiers before export, aggregated to cohort level where sample sizes permit, and used solely to improve diagnostic accuracy, intervention effectiveness, and the knowledge graph. No third-party advertising or commercial analytics. The research data pipeline is documented in the IRB protocol and the consent forms name exactly what is collected and why.
- Graceful degradation. If the LLM is unavailable, fall back to rule-based templates. Core diagnostic functions work without an internet connection.
- Institutional flexibility. The frontend is separate and institution-dependent. The agentic backend exposes APIs that any frontend can consume.
- Event sourcing. Every state change is an appended event. No destructive updates to learning data. Mastery, escalation state, and profiles are all materialized views.
- Domain agnosticism. The core engine never references specific misconception IDs, concept names, or subject matter. All domain knowledge lives in configuration files.
The current schema stores student_mastery as a mutable row overwritten on every update.
The responses table stores individual events, but mastery is a snapshot with no history.
Every agentic layer depends on temporal reasoning (when did mastery change, how fast does
it decay, what intervention was active when resolution occurred). Without an event log,
none of that is possible.
Every state change in the system becomes an immutable event. The event log is append-only.
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
event_type TEXT NOT NULL,
entity_type TEXT NOT NULL, -- 'student', 'classroom', 'assignment'
entity_id INTEGER NOT NULL,
payload TEXT NOT NULL, -- JSON blob, schema varies by event_type
created_at TEXT DEFAULT (datetime('now')),
created_by TEXT -- 'system', 'teacher:<id>', 'student:<id>'
);
CREATE INDEX idx_events_entity ON events(entity_type, entity_id, created_at);
CREATE INDEX idx_events_type ON events(event_type, created_at);| event_type | entity_type | payload schema |
|---|---|---|
response.submitted |
student | {problem_id, student_text, correct, misconception_id, confidence, concept_id, assignment_id, latency_ms} |
mastery.updated |
student | {concept_id, old_level, new_level, trigger_event_id} |
intervention.assigned |
student | {misconception_id, modality, intervention_text, escalation_level, selected_by} |
intervention.outcome |
student | {intervention_event_id, outcome, assessed_at, responses_since} |
assignment.created |
classroom | {title, problem_ids, assigned_by} |
assignment.completed |
student | {assignment_id, score, misconceptions_detected} |
coaching.plan_generated |
classroom | {plan_type, structured_plan_json, rendered_text} |
coaching.plan_acknowledged |
classroom | {plan_event_id, teacher_id, modifications} |
profile.computed |
student | {profile_json, computed_at} |
alert.created |
student | {alert_type, severity, message, recommendation} |
alert.resolved |
student | {alert_event_id, resolved_by} |
graph LR
subgraph EventStore["Event Store (Append-Only)"]
RE[ResponseEvent]
MUE[MasteryUpdateEvent]
IAE[InterventionAssignedEvent]
IOE[InterventionOutcomeEvent]
AE[AssignmentEvent]
CPE[CoachingPlanEvent]
end
subgraph Views["Materialized Views (Rebuilt from Events)"]
SM[student_mastery_current]
ES[escalation_state_current]
LPV[learning_profile_cache]
IE[intervention_effectiveness]
MC[misconception_cooccurrence]
end
RE --> SM
RE --> LPV
RE --> MC
MUE --> SM
IAE --> ES
IAE --> IE
IOE --> ES
IOE --> IE
IOE --> LPV
AE --> SM
CPE --> Views
These are regular tables rebuilt from the event log. They can be dropped and recreated at any time without data loss. They exist purely for query performance.
-- Current mastery per student per concept (replaces the mutable student_mastery table)
CREATE TABLE student_mastery_current (
student_id INTEGER NOT NULL,
concept_id TEXT NOT NULL,
mastery_level REAL NOT NULL,
attempts INTEGER NOT NULL,
last_event_id INTEGER NOT NULL,
updated_at TEXT NOT NULL,
PRIMARY KEY (student_id, concept_id)
);
-- Current escalation state per student per misconception
CREATE TABLE escalation_state_current (
student_id INTEGER NOT NULL,
misconception_id TEXT NOT NULL,
state TEXT NOT NULL, -- detected, intervention_assigned, modality_switched,
-- prerequisite_check, prereq_remediation, escalated, resolved
modalities_tried TEXT NOT NULL, -- JSON array: ["visual", "concrete"]
attempt_count INTEGER NOT NULL,
last_event_id INTEGER NOT NULL,
updated_at TEXT NOT NULL,
PRIMARY KEY (student_id, misconception_id)
);
-- Cached learning profile (recomputed periodically)
CREATE TABLE learning_profile_cache (
student_id INTEGER PRIMARY KEY,
error_pattern TEXT NOT NULL, -- careless, systematic, foundational_gap, transfer_failure
avg_latency_ms REAL,
mastery_decay_rate REAL,
learning_velocity REAL,
modality_vector TEXT NOT NULL, -- JSON: {"visual": 0.68, "concrete": 0.74, ...}
transfer_success REAL,
computed_at TEXT NOT NULL
);
-- Intervention effectiveness (aggregated across students)
CREATE TABLE intervention_effectiveness (
misconception_id TEXT NOT NULL,
modality TEXT NOT NULL,
attempts INTEGER NOT NULL,
resolutions INTEGER NOT NULL,
resolution_rate REAL NOT NULL,
updated_at TEXT NOT NULL,
PRIMARY KEY (misconception_id, modality)
);The migration must preserve all existing data. Write a script that:
- Creates the
eventstable - Reads every row from
responsesand inserts a correspondingresponse.submittedevent - Reads every row from
student_masteryand inserts amastery.updatedevent - Reads every row from
alertsand inserts analert.createdevent - Creates all materialized view tables and populates them from the new events
- Renames old tables to
_legacy_*(do not drop them until the migration is verified) - Updates all API endpoints to read from materialized views and write to the event log
Create api/events.py with:
from __future__ import annotations
import json
from datetime import datetime, timezone
from api.database import get_db
def append_event(
event_type: str,
entity_type: str,
entity_id: int,
payload: dict,
created_by: str = "system",
) -> int:
"""Append an immutable event and return its ID."""
with get_db() as conn:
cursor = conn.execute(
"""INSERT INTO events (event_type, entity_type, entity_id, payload, created_by)
VALUES (?, ?, ?, ?, ?)""",
(event_type, entity_type, entity_id, json.dumps(payload), created_by),
)
return cursor.lastrowid
def get_events(
entity_type: str,
entity_id: int,
event_type: str | None = None,
since: str | None = None,
) -> list[dict]:
"""Query events for an entity, optionally filtered by type and time."""
with get_db() as conn:
query = "SELECT * FROM events WHERE entity_type = ? AND entity_id = ?"
params: list = [entity_type, entity_id]
if event_type:
query += " AND event_type = ?"
params.append(event_type)
if since:
query += " AND created_at >= ?"
params.append(since)
query += " ORDER BY created_at ASC"
rows = conn.execute(query, params).fetchall()
return [
{**dict(row), "payload": json.loads(row["payload"])}
for row in rows
]Create api/migrate_to_events.py as a standalone migration script that performs steps
1-6 above. Run it once, verify, then update the API layer.
Tip
You can verify the migration is correct by asserting that the materialized
student_mastery_current view matches the old student_mastery table row-for-row
before cutting over.
| File | Action | Description |
|---|---|---|
api/events.py |
Create | Event log append/query functions |
api/migrate_to_events.py |
Create | One-time migration script |
api/database.py |
Modify | Add events table and materialized views to init_db() |
Every student with a given misconception receives the same static string. The system ignores whether the student has already seen this intervention and it failed, whether prerequisite mastery is the real blocker, how many times this misconception has recurred, and what intervention approaches have worked for similar students.
The system supports five modalities per misconception. These are not hardcoded in Python
(as they are today in api/engine.py). They live in a domain configuration file.
| Modality | Description | When to use |
|---|---|---|
| visual | Diagrams, area models, number lines, graphs | First attempt; students with spatial reasoning strength |
| concrete | Numerical substitution, physical manipulatives, worked numeric examples | After visual fails; students who respond to numbers over symbols |
| pattern | Side-by-side worked examples, "what's the pattern?" activities | Inductive learners |
| verbal | Analogy, metaphor, narrative explanation | Students who respond to story and language |
| peer | Pair with a student who resolved this misconception | Social learners; when a resolved peer exists in the class |
Each domain provides an intervention catalog. The algebra instantiation looks like this:
{
"domain": "algebra_fundamentals",
"interventions": {
"dist_first_term_only": {
"visual": {
"text": "Area model: draw a rectangle with width = factor and length = (a + b).",
"materials": ["area_model_worksheet.pdf"],
"estimated_minutes": 5
},
"concrete": {
"text": "Numerical substitution: try 2(3+4) by distributing and by computing parentheses first. Compare.",
"materials": [],
"estimated_minutes": 3
},
"pattern": {
"text": "Show 3 worked examples side-by-side. Ask: what operation was applied to every term?",
"materials": ["worked_examples_dist.pdf"],
"estimated_minutes": 5
},
"verbal": {
"text": "The factor is like a delivery person visiting every house on the street, not just the first one.",
"materials": [],
"estimated_minutes": 2
},
"peer": {
"text": "Pair with a student who resolved this misconception for collaborative practice.",
"materials": [],
"estimated_minutes": 10,
"requires_resolved_peer": true
}
}
}
}The escalation logic is not a numbered list. It is a per-student, per-misconception state machine with explicit states and transitions.
stateDiagram-v2
[*] --> Detected: Misconception classified
Detected --> InterventionAssigned: Select modality\n(highest prior success rate)
InterventionAssigned --> Assessed: Student re-assessed\n(within next 3 responses)
Assessed --> Resolved: Misconception absent\nin re-assessment
Assessed --> ModalitySwitched: Misconception persists\n(try next modality)
ModalitySwitched --> ReAssessed: Student re-assessed
ReAssessed --> Resolved
ReAssessed --> PrerequisiteCheck: Persists 2nd time
PrerequisiteCheck --> PrereqRemediation: Prereq mastery < 0.60
PrerequisiteCheck --> ModalitySwitched: Prereqs OK,\ntry 3rd modality
PrereqRemediation --> Assessed: After prereq\nremediation cycle
ModalitySwitched --> Escalated: All modalities\nexhausted (4th persist)
Escalated --> TeacherConference: Human intervention
TeacherConference --> Resolved
TeacherConference --> IEPReferral: Persistent across\nmultiple conferences
Resolved --> [*]
| Current state | Event | Condition | Next state |
|---|---|---|---|
| (none) | response.submitted with misconception |
First detection | detected |
detected |
System selects modality | Always | intervention_assigned |
intervention_assigned |
3 responses assessed | Misconception absent | resolved |
intervention_assigned |
3 responses assessed | Misconception present | modality_switched |
modality_switched |
3 responses assessed | Misconception absent | resolved |
modality_switched |
3 responses assessed | Misconception present, attempt < 3 | prerequisite_check |
prerequisite_check |
Prereq mastery check | Any prereq < 0.60 | prereq_remediation |
prerequisite_check |
Prereq mastery check | All prereqs >= 0.60 | modality_switched |
prereq_remediation |
Prereq mastery >= 0.60 | Re-enter main flow | intervention_assigned |
modality_switched |
3 responses assessed | Misconception present, all modalities tried | escalated |
escalated |
Teacher acknowledges | Always | teacher_conference |
teacher_conference |
Teacher marks resolved | Resolution confirmed | resolved |
teacher_conference |
Multiple conferences fail | Persistent issue | iep_referral |
Create api/intervention_manager.py:
from __future__ import annotations
from dataclasses import dataclass
from enum import Enum
class EscalationState(str, Enum):
DETECTED = "detected"
INTERVENTION_ASSIGNED = "intervention_assigned"
MODALITY_SWITCHED = "modality_switched"
PREREQUISITE_CHECK = "prerequisite_check"
PREREQ_REMEDIATION = "prereq_remediation"
ESCALATED = "escalated"
TEACHER_CONFERENCE = "teacher_conference"
IEP_REFERRAL = "iep_referral"
RESOLVED = "resolved"
MODALITY_ORDER = ["visual", "concrete", "pattern", "verbal", "peer"]
@dataclass
class InterventionContext:
student_id: int
misconception_id: str
current_state: EscalationState
modalities_tried: list[str]
attempt_count: int
prerequisite_mastery: dict[str, float]
modality_vector: dict[str, float] # from learning profile
class_effectiveness: dict[str, float] # from cross-student analysis
def select_intervention(ctx: InterventionContext) -> tuple[str, EscalationState]:
"""Return (selected_modality, new_state) based on escalation logic.
Uses Thompson sampling over the modality vector to balance exploitation
(pick the historically best modality) with exploration (occasionally try
others to improve estimates).
"""
available = [m for m in MODALITY_ORDER if m not in ctx.modalities_tried]
if not available:
return ("teacher_conference", EscalationState.ESCALATED)
if ctx.attempt_count >= 2:
# Check prerequisites before trying more modalities
low_prereqs = {
c: m for c, m in ctx.prerequisite_mastery.items() if m < 0.60
}
if low_prereqs:
return ("prereq_remediation", EscalationState.PREREQ_REMEDIATION)
# Thompson sampling: draw from Beta(successes+1, failures+1) per modality
# For available modalities, use class-wide effectiveness as prior
import random
scores = {}
for modality in available:
alpha = ctx.class_effectiveness.get(modality, 0.5) * 10 + 1
beta_param = (1 - ctx.class_effectiveness.get(modality, 0.5)) * 10 + 1
# Blend with student-specific vector if available
student_weight = ctx.modality_vector.get(modality, 0.5)
alpha += student_weight * 5
beta_param += (1 - student_weight) * 5
scores[modality] = random.betavariate(alpha, beta_param)
selected = max(scores, key=scores.get)
new_state = (
EscalationState.INTERVENTION_ASSIGNED
if ctx.attempt_count == 0
else EscalationState.MODALITY_SWITCHED
)
return (selected, new_state)Why not pick the modality with the highest success rate?
If the system always exploits (picks the best-known option), it never learns whether other modalities might work better. Consider: the "visual" modality was tried first for 22 students and resolved 68% of the time. The "pattern" modality was tried only 3 times (because the system always picks "visual" first) and happened to resolve 2/3 = 67%. The system would keep picking "visual" forever, never discovering that "pattern" might actually resolve 81% if it had more data.
Thompson sampling solves this by drawing a random sample from each modality's success distribution. Modalities with less data have wider distributions, so they occasionally "win" the draw and get selected, generating new data. Over time, the system converges on the true best modality while maintaining enough exploration to detect shifts.
The math:
- For each modality
$m$ , maintain counts:$s_m$ (successes) and$f_m$ (failures) - Draw
$\theta_m \sim \text{Beta}(s_m + 1, f_m + 1)$ - Select
$m^* = \arg\max_m \theta_m$
The +1 terms are the prior (uniform Beta(1,1)). In practice, we initialize with the
class-wide effectiveness rates as an informative prior, so a brand-new student benefits
from class-wide data immediately.
An intervention's outcome is assessed by checking whether the misconception appears in the student's next 3 responses on that concept.
def assess_intervention_outcome(
student_id: int,
misconception_id: str,
intervention_event_id: int,
) -> str:
"""Check the last 3 responses since intervention for misconception recurrence.
Returns: 'resolved', 'persisted', or 'not_assessed' (fewer than 3 responses).
"""
events = get_events(
entity_type="student",
entity_id=student_id,
event_type="response.submitted",
)
# Find responses after the intervention event
intervention_event = get_event_by_id(intervention_event_id)
post_responses = [
e for e in events
if e["created_at"] > intervention_event["created_at"]
and e["payload"]["concept_id"] == get_concept_for_misconception(misconception_id)
]
if len(post_responses) < 3:
return "not_assessed"
recent_three = post_responses[:3]
recurred = any(
r["payload"].get("misconception_id") == misconception_id
for r in recent_three
)
return "persisted" if recurred else "resolved"| Method | Path | Description |
|---|---|---|
GET |
/api/students/{id}/interventions |
Current and historical interventions for a student |
GET |
/api/students/{id}/interventions/active |
Currently assigned, unresolved interventions |
POST |
/api/students/{id}/interventions/assign |
Trigger the escalation state machine and assign an intervention |
PATCH |
/api/interventions/{id}/outcome |
Record an outcome (resolved/persisted) |
GET |
/api/classrooms/{id}/intervention-effectiveness |
Class-wide modality effectiveness rates |
| File | Action | Description |
|---|---|---|
api/intervention_manager.py |
Create | Escalation state machine + Thompson sampling |
data/interventions.json |
Create | Domain-specific intervention catalog |
api/main.py |
Modify | Add Phase 1 API endpoints |
api/database.py |
Modify | Add escalation_state_current view to init_db() |
api/engine.py |
Modify | Remove hardcoded HINTS/INTERVENTIONS dicts; load from interventions.json |
Question. Does Thompson sampling over intervention modalities produce higher misconception resolution rates than a greedy policy (always pick the modality with the highest historical success rate)?
Design. Monte Carlo simulation, 1 000 simulated students x 50 interactions each, comparing three policies:
| Policy | Logic |
|---|---|
| Thompson | Draw from Beta(successes+1, failures+1) per modality, pick max |
| Greedy | Always pick the modality with the highest observed success rate |
| Uniform random | Pick a modality uniformly at random |
Data generation. Each simulated student has a latent modality preference vector
(5 elements, one per modality) drawn from Dirichlet(1,1,1,1,1). When the system
assigns modality preference_vector[m]. This creates
ground-truth heterogeneity: some students genuinely respond better to visual,
others to concrete.
Primary metrics.
| Metric | Definition |
|---|---|
| Cumulative resolution rate | Fraction of interventions that resolve the misconception, measured at intervals of 10, 20, 30, 40, 50 interactions |
| Regret | Difference between the policy's resolution rate and the oracle policy (always picks the student's true best modality) |
| Convergence speed | Number of interactions until the policy's running best-modality estimate matches the student's true best |
Scalability axis. Vary number of modalities from 3 to 10 to test whether Thompson sampling degrades gracefully when the action space grows (relevant for domains with more intervention types).
Expected outcome. Thompson should match or beat greedy after ~15 interactions
(exploration payoff) and should significantly beat uniform. Regret should decrease
as
Artifacts.
experiments/04_thompson_vs_greedy/artifacts/results.jsonexperiments/04_thompson_vs_greedy/artifacts/resolution_curves.pngexperiments/04_thompson_vs_greedy/artifacts/regret_curves.pngexperiments/04_thompson_vs_greedy/artifacts/scalability.png
Reproducibility. python experiments/04_thompson_vs_greedy/run.py (seed=42).
Question. Under the Phase 1 escalation state machine, what fraction of misconceptions reach each terminal state (resolved, teacher_conference, iep_referral), and how sensitive is this to the misconception resolution probability?
Design. Absorbing Markov chain analysis + Monte Carlo validation.
Step 1 (analytical): model the escalation state machine as an absorbing Markov chain.
The transient states are detected, intervention_assigned, modality_switched,
prerequisite_check, and prereq_remediation. The absorbing states are resolved,
teacher_conference, and iep_referral. Compute the absorption probabilities and
expected number of steps analytically from the fundamental matrix
Step 2 (simulation): run 10 000 misconception episodes through the state machine with configurable resolution probability per modality attempt. Validate that empirical absorption fractions match the analytical result.
Sensitivity sweep. Vary per-attempt resolution probability from 0.10 to 0.90 in steps of 0.05. For each value, compute:
| Metric | Definition |
|---|---|
| P(resolved) | Fraction reaching the resolved state |
| P(teacher_conference) | Fraction reaching teacher conference |
| E[steps to absorption] | Expected interactions before a terminal state |
| E[modalities tried] | Expected number of distinct modalities used |
Design choice axis. Vary the number of modality attempts allowed before escalation (currently 4) from 2 to 8. Plot P(resolved) vs attempts-allowed for different resolution probabilities.
Expected outcome. At resolution probability 0.50 (the v2 RCT setting),
90% of misconceptions should resolve before teacher escalation. At 0.20, the system should escalate frequently, validating that the safety net works.
Artifacts.
experiments/06_escalation_convergence/artifacts/results.jsonexperiments/06_escalation_convergence/artifacts/absorption_probabilities.pngexperiments/06_escalation_convergence/artifacts/expected_steps.pngexperiments/06_escalation_convergence/artifacts/attempts_vs_resolution.png
Reproducibility. python experiments/06_escalation_convergence/run.py (seed=42).
Every student is treated as interchangeable. The system has no model of individual learning patterns, only per-concept mastery scores. Two students with identical mastery of 0.55 on "distributive property" may need completely different interventions: one has a systematic misconception that needs targeted remediation, the other has a foundational gap in integer signs that cascades forward.
Classify each student into one of four error patterns based on their response history. This is a deterministic classification, not a machine learning model.
| Pattern | Detection rule | Implication |
|---|---|---|
| Careless errors | Mastery >= 0.70 AND error rate < 20% AND errors show no consistent misconception | Attention/focus strategies, not re-teaching |
| Systematic misconception | Same misconception_id appears in >= 3 responses on a concept | Targeted conceptual intervention via escalation state machine |
| Foundational gap | Errors on concept X AND prerequisite concept Y has mastery < 0.60 | Stop current topic; remediate prerequisite first |
| Transfer failure | Mastery >= 0.80 on easy/medium problems AND < 0.50 on hard problems for same concept | Mixed-context practice, not more of the same |
Three signals extracted from timestamped response data:
Response latency (already captured as latency_ms in the event payload):
- Fast-and-wrong (< 10s): likely guessing or careless. Intervention: slow-down prompts.
- Slow-and-wrong (> 60s): likely struggling. Intervention: scaffolding and worked examples.
- Slow-and-right (> 60s): learning is happening but effortful. Monitor, do not intervene.
Mastery decay rate:
Compute over all pairs of consecutive mastery snapshots for a concept, then average. High decay (> 0.05/day) indicates the student needs spaced repetition with shorter intervals.
Learning velocity: number of responses required to move from mastery 0.35 to 0.85 on a concept. This normalizes for concept difficulty by comparing within-concept.
Higher velocity means fewer problems needed. Students with low velocity need more scaffolding, not just more volume.
A 5-element vector computed from the student's intervention history:
Where each element is:
The +1/+2 is Laplace smoothing (equivalent to a Beta(1,1) prior), so modalities with
zero data default to 0.50 rather than undefined. This vector feeds into the Thompson
sampling in Phase 1.
Create api/learning_profiler.py:
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime, timezone
from api.events import get_events
MODALITY_ORDER = ["visual", "concrete", "pattern", "verbal", "peer"]
@dataclass
class LearningProfile:
student_id: int
error_pattern: str # careless, systematic, foundational_gap, transfer_failure
avg_latency_ms: float
mastery_decay_rate: float # avg mastery loss per day across concepts
learning_velocity: float # avg responses to reach mastery
modality_vector: dict[str, float] # {visual: 0.68, concrete: 0.74, ...}
transfer_success_rate: float
computed_at: str
def compute_learning_profile(student_id: int) -> LearningProfile:
"""Derive learning characteristics from event history.
Query the event log for all response.submitted and intervention.outcome
events for this student. Compute each metric from the raw data.
Cache the result in learning_profile_cache.
"""
responses = get_events("student", student_id, "response.submitted")
interventions = get_events("student", student_id, "intervention.outcome")
return LearningProfile(
student_id=student_id,
error_pattern=_classify_error_pattern(responses),
avg_latency_ms=_compute_avg_latency(responses),
mastery_decay_rate=_compute_decay_rate(student_id),
learning_velocity=_compute_velocity(student_id),
modality_vector=_compute_modality_vector(interventions),
transfer_success_rate=_compute_transfer_rate(responses),
computed_at=datetime.now(timezone.utc).isoformat(),
)
def _classify_error_pattern(responses: list[dict]) -> str:
"""Classify the student's dominant error pattern.
Logic:
1. Group responses by concept_id
2. For each concept, check prerequisite mastery -> foundational_gap
3. Check if same misconception_id repeats 3+ times -> systematic
4. Check if mastery is high but hard problems fail -> transfer_failure
5. Otherwise -> careless
Return the most frequent pattern across concepts.
"""
...
def _compute_avg_latency(responses: list[dict]) -> float:
"""Mean response latency in milliseconds."""
latencies = [r["payload"].get("latency_ms", 0) for r in responses]
return sum(latencies) / len(latencies) if latencies else 0.0
def _compute_decay_rate(student_id: int) -> float:
"""Average mastery loss per day across all concepts.
Query mastery.updated events, compute (old_level - new_level) / days_between
for consecutive pairs, return the mean.
"""
...
def _compute_velocity(student_id: int) -> float:
"""Average number of responses to go from mastery 0.35 to 0.85 per concept.
Query mastery.updated events per concept, find the response count between
the first event with level >= 0.35 and the first event with level >= 0.85.
"""
...
def _compute_modality_vector(interventions: list[dict]) -> dict[str, float]:
"""Compute Laplace-smoothed effectiveness per modality."""
counts = {m: {"success": 0, "total": 0} for m in MODALITY_ORDER}
for iv in interventions:
modality = iv["payload"].get("modality")
if modality not in counts:
continue
counts[modality]["total"] += 1
if iv["payload"].get("outcome") == "resolved":
counts[modality]["success"] += 1
return {
m: (c["success"] + 1) / (c["total"] + 2)
for m, c in counts.items()
}
def _compute_transfer_rate(responses: list[dict]) -> float:
"""Fraction of hard-difficulty problems answered correctly among
concepts where the student has mastery >= 0.80 on easy/medium."""
...| Method | Path | Description |
|---|---|---|
GET |
/api/students/{id}/profile |
Current learning profile (cached or recomputed) |
GET |
/api/classrooms/{id}/profiles |
All student profiles in a classroom |
| File | Action | Description |
|---|---|---|
api/learning_profiler.py |
Create | Profile computation and caching |
api/main.py |
Modify | Add profile endpoints |
The problem bank has 28 problems. The recommend_problems() function picks the 3 easiest
for a concept. No awareness of problems already seen, optimal difficulty for current
mastery, spaced repetition, or misconception-targeted selection.
Replace categorical difficulty labels (easy/medium/hard) with Item Response Theory parameters on every problem. Use the Rasch model (1-parameter logistic, 1PL) to start:
Where:
-
$\theta$ = student ability (derived from current mastery:$\theta = \ln\frac{m}{1-m}$ ) -
$b$ = item difficulty (estimated from response data or set manually)
The optimal learning zone targets items where
For
So the system selects problems with difficulty
Add IRT parameters and misconception diagnostic tags to each problem:
{
"problem_id": "dist_01",
"concept": "distributive_property",
"problem_text": "Expand: 3(x + 4)",
"correct_answer": "3x + 12",
"difficulty": "easy",
"irt_b": -1.2,
"irt_discrimination": 1.0,
"diagnostic_for": ["dist_first_term_only", "dist_drop_parens"],
"template_id": "dist_expand_simple",
"template_params": {"factor": 3, "var": "x", "constant": 4}
}The diagnostic_for field indicates which misconceptions this problem is designed to
surface. When the system needs to verify whether dist_first_term_only has resolved, it
selects a problem tagged with that misconception.
The template_id and template_params fields enable parameterized generation: the system
can create novel instances of the same structural problem with different coefficients, so
students never see the exact same problem twice.
For new problems without response data, set initial
| Category | Initial |
|---|---|
| easy | -1.5 |
| medium | 0.0 |
| hard | 1.5 |
After 30+ responses, re-estimate
import math
from scipy.optimize import minimize_scalar
def estimate_difficulty(
responses: list[dict],
student_abilities: dict[int, float],
) -> float:
"""Estimate item difficulty b from observed responses using MLE.
responses: list of {student_id, correct} dicts
student_abilities: {student_id: theta} mapping
"""
def neg_log_likelihood(b: float) -> float:
nll = 0.0
for r in responses:
theta = student_abilities[r["student_id"]]
p = 1.0 / (1.0 + math.exp(-(theta - b)))
p = max(min(p, 0.999), 0.001) # clamp for numerical stability
if r["correct"]:
nll -= math.log(p)
else:
nll -= math.log(1 - p)
return nll
result = minimize_scalar(
neg_log_likelihood, bounds=(-4.0, 4.0), method="bounded"
)
return round(result.x, 3)import math
def generate_assignment(
student_id: int,
target_concepts: list[str],
count: int = 5,
) -> list[str]:
"""Generate a personalized problem set using IRT-based selection."""
profile = get_cached_profile(student_id)
seen = get_seen_problem_ids(student_id)
mastery = get_current_mastery(student_id)
problems: list[str] = []
# Step 1: Prerequisite remediation
# If any prerequisite has mastery < 0.60, include one easy remediation problem
for concept in target_concepts:
for prereq in get_prerequisites(concept):
if mastery.get(prereq, 0) < 0.60:
problems.extend(
select_by_irt(prereq, student_id, seen, target_p=0.80, n=1)
)
# Step 2: Target concept problems at optimal difficulty
# Select items where P(correct) ~ 0.70 for the student's current ability
for concept in target_concepts:
# If there's an active unresolved misconception, prefer diagnostic problems
active_misconceptions = get_active_misconceptions(student_id, concept)
if active_misconceptions:
problems.extend(
select_diagnostic(concept, active_misconceptions[0], seen, n=1)
)
problems.extend(
select_by_irt(concept, student_id, seen, target_p=0.70, n=1)
)
# Step 3: Spaced repetition
# Include one problem from a previously mastered concept not seen in 7+ days
stale = get_stale_mastered_concepts(student_id, days=7)
if stale:
problems.extend(
select_by_irt(stale[0], student_id, seen, target_p=0.80, n=1)
)
return problems[:count]
def select_by_irt(
concept: str,
student_id: int,
seen: set[str],
target_p: float,
n: int,
) -> list[str]:
"""Select n unseen problems closest to the target P(correct) for this student."""
mastery = get_current_mastery_for_concept(student_id, concept)
theta = math.log(max(mastery, 0.01) / max(1 - mastery, 0.01))
target_b = theta - math.log(target_p / (1 - target_p))
candidates = [
p for p in get_problems_for_concept(concept)
if p["problem_id"] not in seen
]
# Sort by distance from target difficulty
candidates.sort(key=lambda p: abs(p.get("irt_b", 0.0) - target_b))
return [p["problem_id"] for p in candidates[:n]]
def select_diagnostic(
concept: str,
misconception_id: str,
seen: set[str],
n: int,
) -> list[str]:
"""Select unseen problems tagged as diagnostic for a specific misconception."""
candidates = [
p for p in get_problems_for_concept(concept)
if p["problem_id"] not in seen
and misconception_id in p.get("diagnostic_for", [])
]
return [p["problem_id"] for p in candidates[:n]]To prevent item exhaustion (28 problems are not enough for adaptive sequencing), create parameterized templates:
import random
TEMPLATES = {
"dist_expand_simple": {
"pattern": "{factor}({var} + {constant})",
"answer_pattern": "{factor}{var} + {product}",
"concept": "distributive_property",
"diagnostic_for": ["dist_first_term_only", "dist_drop_parens"],
"irt_b_base": -1.2,
"param_ranges": {
"factor": (2, 9),
"constant": (1, 12),
},
},
}
def generate_from_template(template_id: str) -> dict:
"""Generate a novel problem instance from a template."""
tmpl = TEMPLATES[template_id]
factor = random.randint(*tmpl["param_ranges"]["factor"])
constant = random.randint(*tmpl["param_ranges"]["constant"])
product = factor * constant
var = random.choice(["x", "y", "n", "a"])
return {
"problem_id": f"{template_id}_{factor}_{var}_{constant}",
"concept": tmpl["concept"],
"problem_text": f"Expand: {factor}({var} + {constant})",
"correct_answer": f"{factor}{var} + {product}",
"irt_b": tmpl["irt_b_base"],
"diagnostic_for": tmpl["diagnostic_for"],
"generated": True,
}| Method | Path | Description |
|---|---|---|
POST |
/api/students/{id}/generate-assignment |
Generate a personalized problem set |
GET |
/api/problems/{id}/irt |
Get IRT parameters for a problem |
POST |
/api/problems/estimate-difficulty |
Re-estimate IRT params from accumulated data |
| File | Action | Description |
|---|---|---|
api/problem_sequencer.py |
Create | IRT selection, spaced repetition, template engine |
data/problem_bank.json |
Modify | Add irt_b, diagnostic_for, template_id fields to every problem |
api/engine.py |
Modify | Replace recommend_problems() with call to problem_sequencer |
api/main.py |
Modify | Add personalized assignment generation endpoint |
Question. Does IRT-based problem selection (target P(correct) ~ 0.70) produce higher learning gains than the current categorical approach (pick the 3 easiest problems for a concept)?
Design. Simulated RCT, 500 students per condition, 40 interactions each.
Reuses the existing SimulatedStudent and generate_students infrastructure
with the held-out test evaluation from Experiment 02.
| Condition | Problem selection logic |
|---|---|
| IRT-targeted | Select the problem whose IRT difficulty |
| Categorical-easy | Always pick the easiest unseen problem (current system behavior) |
| Categorical-hard | Always pick the hardest unseen problem (adversarial baseline) |
| Random | Pick a random unseen problem |
IRT parameter assignment. Assign
Primary metrics.
| Metric | Definition |
|---|---|
| Post-test score gain | Held-out test score (post) minus (pre), same decoupled assessment as Experiment 02 |
| Desirable difficulty hit rate | Fraction of assigned problems where the student's actual accuracy falls in [0.55, 0.85] |
| Concepts mastered | Number of concepts reaching mastery threshold |
| Efficiency | Test score gain per interaction |
Scalability axis. Vary problem bank size from 15 to 100 (using the template engine to generate additional problems) and measure whether IRT's advantage grows with a richer item pool.
Expected outcome. IRT-targeted should beat categorical-easy by d ~ 0.15-0.30. Categorical-hard should perform worst (frustration). The advantage of IRT should grow with larger problem banks because it has more items near the optimal difficulty to choose from.
Artifacts.
experiments/05_irt_vs_categorical/artifacts/results.jsonexperiments/05_irt_vs_categorical/artifacts/learning_curves.pngexperiments/05_irt_vs_categorical/artifacts/difficulty_targeting.pngexperiments/05_irt_vs_categorical/artifacts/bank_size_scaling.png
Reproducibility. python experiments/05_irt_vs_categorical/run.py (seed=42).
The system analyzes students individually. It does not learn from the class as a whole. This means the system cannot discover that two misconceptions always co-occur, that one intervention modality works better for third period than second period, or that a prerequisite relationship exists that the hand-authored knowledge graph missed.
Build a matrix
High values of
from collections import Counter
def compute_cooccurrence_matrix(classroom_id: int) -> dict[str, dict[str, float]]:
"""Compute misconception co-occurrence rates across all students in a classroom."""
students = get_students_in_classroom(classroom_id)
# For each student, get the set of misconceptions ever detected
student_misconceptions: dict[int, set[str]] = {}
for student in students:
events = get_events("student", student["id"], "response.submitted")
student_misconceptions[student["id"]] = {
e["payload"]["misconception_id"]
for e in events
if e["payload"].get("misconception_id")
}
# Compute C[i][j] = P(j | i)
all_misconceptions: set[str] = set()
for ms in student_misconceptions.values():
all_misconceptions.update(ms)
matrix: dict[str, dict[str, float]] = {}
for mi in all_misconceptions:
matrix[mi] = {}
students_with_mi = [
s for s, ms in student_misconceptions.items() if mi in ms
]
for mj in all_misconceptions:
if mi == mj:
matrix[mi][mj] = 1.0
continue
if not students_with_mi:
matrix[mi][mj] = 0.0
continue
co_count = sum(
1 for s in students_with_mi
if mj in student_misconceptions[s]
)
matrix[mi][mj] = round(co_count / len(students_with_mi), 3)
return matrixAggregate intervention outcomes across all students in a classroom:
dist_first_term_only:
visual: 68% resolution (n=22, 95% CI: [48%, 84%])
concrete: 74% resolution (n=19, 95% CI: [51%, 90%])
pattern: 81% resolution (n=16, 95% CI: [54%, 96%])
verbal: 55% resolution (n=11, 95% CI: [23%, 83%])
peer: [insufficient data, n=3]
Confidence intervals matter. With small samples, the difference between 68% and 81% is not statistically significant. The system reports confidence intervals alongside point estimates so teachers (and the Thompson sampling algorithm) make calibrated decisions.
Use the Wilson score interval for binomial proportions:
Where
import math
def wilson_interval(successes: int, total: int, z: float = 1.96) -> tuple[float, float]:
"""Compute Wilson score confidence interval for a binomial proportion."""
if total == 0:
return (0.0, 1.0)
p_hat = successes / total
denominator = 1 + z**2 / total
center = (p_hat + z**2 / (2 * total)) / denominator
spread = z * math.sqrt(p_hat * (1 - p_hat) / total + z**2 / (4 * total**2)) / denominator
return (max(0.0, round(center - spread, 3)), min(1.0, round(center + spread, 3)))Compare intervention effectiveness between classrooms:
def compare_cohorts(
classroom_ids: list[int],
misconception_id: str,
) -> list[dict]:
"""Compare resolution rates across classrooms for a given misconception."""
results = []
for cid in classroom_ids:
effectiveness = get_intervention_effectiveness(cid, misconception_id)
results.append({
"classroom_id": cid,
"by_modality": effectiveness,
"avg_resolution_days": compute_avg_resolution_time(cid, misconception_id),
})
return resultsWhen
| Method | Path | Description |
|---|---|---|
GET |
/api/classrooms/{id}/misconception-cooccurrence |
Co-occurrence matrix for a classroom |
GET |
/api/classrooms/{id}/intervention-effectiveness |
Aggregated effectiveness with confidence intervals |
GET |
/api/analytics/cohort-comparison |
Cross-classroom comparison for a misconception |
GET |
/api/analytics/discovered-prerequisites |
Proposed knowledge graph edges from data |
| File | Action | Description |
|---|---|---|
api/pattern_analyzer.py |
Create | Co-occurrence matrix, effectiveness aggregation, cohort comparison |
api/main.py |
Modify | Add analytics endpoints |
The dashboard shows data. Teachers must interpret data and decide what to do. This is the cognitive load the system should reduce. The coaching agent is the highest-impact feature, but it depends on having real data from Phases 1-4 to generate meaningful recommendations.
The coaching agent has two distinct layers:
- A deterministic planning layer that queries diagnostic state, scores priorities, and builds a structured action plan (JSON). This layer is testable, auditable, and works without an LLM.
- A rendering layer that converts the structured plan into natural language (via templates or LLM). This layer is swappable: a mobile app might consume the JSON directly without any narrative.
sequenceDiagram
participant CRON as Scheduler (Pre-Class)
participant CA as Coaching Agent
participant PA as Pattern Analyzer
participant LP as Learning Profiler
participant EL as Event Log
participant LLM as LLM (Optional)
participant T as Teacher Dashboard
CRON->>CA: generate_preclass_plan(classroom_id)
CA->>EL: Query all unresolved alerts
CA->>LP: Batch get_profile() for all students
CA->>PA: Get class-wide misconception distribution
CA->>CA: Priority scoring algorithm
Note over CA: Score = severity * recency * impact_radius
CA->>CA: Build structured plan (JSON)
alt Template rendering
CA->>CA: Fill Jinja2 template with plan data
else LLM rendering
CA->>LLM: Structured context + system prompt
LLM-->>CA: Natural language narrative
end
CA->>EL: Append CoachingPlanEvent
CA-->>T: Rendered plan with action items
T->>CA: Teacher acknowledges/modifies plan
CA->>EL: Append PlanAcknowledgedEvent
Each potential action item receives a priority score:
Where:
-
$S$ = severity (0-1): how critical is the issue? An escalated misconception scores 1.0; a near-mastery push scores 0.3. -
$R$ = recency (0-1):$e^{-\lambda \Delta t}$ where$\Delta t$ = days since last relevant event and$\lambda = 0.1$ . -
$I$ = impact radius (0-1): fraction of the class affected by this issue. -
$T$ = teacher effort (inverted, 0-1): lower effort actions score higher. A whole-class warm-up (effort=0.2) scores higher than a one-on-one conference (effort=0.8). - Weights:
$w_s = 0.4$ ,$w_r = 0.2$ ,$w_i = 0.2$ ,$w_t = 0.2$ .
import math
from datetime import datetime
def compute_priority_score(
severity: float,
days_since_event: float,
class_fraction_affected: float,
teacher_effort: float,
weights: tuple[float, ...] = (0.4, 0.2, 0.2, 0.2),
) -> float:
"""Compute a priority score for a potential coaching action item.
severity: 0-1 (1 = most critical)
days_since_event: float >= 0
class_fraction_affected: 0-1
teacher_effort: 0-1 (1 = most effort, inverted in scoring)
"""
w_s, w_r, w_i, w_t = weights
recency = math.exp(-0.1 * days_since_event)
effort_inverted = 1.0 - teacher_effort
return round(
w_s * severity + w_r * recency + w_i * class_fraction_affected + w_t * effort_inverted,
3,
){
"classroom_id": 1,
"plan_type": "pre_class",
"generated_at": "2026-03-29T07:30:00Z",
"action_items": [
{
"priority": 1,
"score": 0.92,
"action_type": "pull_out_group",
"students": [
{"id": 12, "name": "James", "misconception": "sign_neg_times_neg", "escalation_level": 3}
],
"recommendation": "Use the number line activity from last Tuesday.",
"evidence": "This intervention resolved sign_neg_times_neg for 4/5 students who tried it.",
"estimated_minutes": 5
},
{
"priority": 2,
"score": 0.71,
"action_type": "whole_class_warmup",
"students": [
{"id": 5, "name": "Sofia", "mastery": 0.82, "concept": "order_of_operations"},
{"id": 8, "name": "Liam", "mastery": 0.78, "concept": "order_of_operations"}
],
"recommendation": "Assign problems ooo_3 and ooo_4 as a warm-up.",
"evidence": "6 students are within 0.03-0.07 of the mastery threshold.",
"estimated_minutes": 3
}
]
}The planning agent generates three types of output. Each starts from the same structured plan; only the rendering differs.
Pre-class action plan (generated before class starts, or the night before):
Today's Plan for Period 2 - Algebra I
Priority 1: Pull-out group (5 min)
James, Sofia, and Liam need integer sign remediation.
Use the number line activity from last Tuesday (it resolved
this misconception for 4/5 students who tried it).
Priority 2: Whole-class warm-up (3 min)
6 students are close to mastering Order of Operations (0.78-0.84).
Assign problems ooo_3 and ooo_4 as a warm-up to push them over threshold.
Priority 3: Monitor Aiden
Aiden dropped from 0.72 to 0.58 on Distributive Property
since last session. Check if yesterday's absence caused regression.
Post-assignment debrief (generated after students complete an assignment):
Assignment "Diagnostic Check #2" Results
New findings:
- 3 new students showing dist_first_term_only (Maria, Chen, David)
- The area model worksheet resolved this for 4/5 previously affected students
- Recommendation: reuse it for the new group
Concerning:
- James got 0/2 on sign problems despite remediation last week
- His misconception (sign_neg_times_neg) has persisted 6 sessions
- Escalation: recommend one-on-one conference with concrete manipulatives
Positive:
- 8 students crossed the 0.85 mastery threshold on Order of Operations
- Class readiness for Distributive Property improved from 54% to 72%
Parent conference narrative (generated on demand for a specific student):
Student Summary: Maria Rodriguez
Period 2 - Algebra I | Generated March 29, 2026
Maria has strong foundational skills: she mastered integer operations
(92%) and order of operations (88%) within the first two weeks.
She is currently working on the distributive property, where she
consistently applies the multiplication to only the first term inside
parentheses. For example, she writes 2(x+3) = 2x+3 instead of 2x+6.
This is the most common error pattern we see at this stage.
We have tried visual scaffolding (area model diagrams) which has not
yet resolved the issue after 4 attempts. Next step: concrete numerical
examples where she can verify by substitution, which has been effective
for similar students.
Maria's work ethic is strong: she completes assignments promptly and
her non-conceptual errors are rare. The distributive property gap is
the single blocker preventing her from moving to combining like terms.
| Output type | Frequency | Rendering method | Latency budget | Cost |
|---|---|---|---|---|
| Pre-class plan | 1x/day/classroom | Jinja2 template | Pre-computed overnight | $0 |
| Post-assignment debrief | 1x/assignment | Jinja2 template | < 2s (synchronous) | $0 |
| Parent narrative | On demand | LLM with structured context | < 10s | ~$0.01 |
For LLM rendering, the system constructs a structured context document and passes it with a constraining system prompt. The teacher reviews the output before sharing; the system never auto-sends anything to parents.
PARENT_NARRATIVE_SYSTEM_PROMPT = """You are a teaching assistant writing a parent
conference summary. Write in plain language at an 8th-grade reading level. Be specific
about what the student can do, what they're working on, and what comes next. Never use
jargon. Never speculate about causes outside the data. Do not mention AI, algorithms,
or system internals. Keep the tone warm, factual, and encouraging."""
def render_parent_narrative(student_id: int, llm_client) -> str:
"""Generate a parent conference narrative using LLM."""
profile = get_cached_profile(student_id)
mastery = get_all_mastery(student_id)
interventions = get_intervention_history(student_id)
alerts = get_active_alerts(student_id)
context = build_structured_context(profile, mastery, interventions, alerts)
response = llm_client.chat(
system=PARENT_NARRATIVE_SYSTEM_PROMPT,
user=f"Write a parent conference summary for this student:\n\n{context}",
max_tokens=500,
temperature=0.3,
)
return response.text| Method | Path | Description |
|---|---|---|
GET |
/api/classrooms/{id}/coaching/pre-class |
Pre-class action plan (cached or generated) |
GET |
/api/classrooms/{id}/coaching/debrief/{assignment_id} |
Post-assignment debrief |
POST |
/api/students/{id}/coaching/parent-narrative |
Generate parent conference narrative |
PATCH |
/api/coaching/plans/{id}/acknowledge |
Teacher acknowledges/modifies a plan |
| File | Action | Description |
|---|---|---|
api/coaching_agent.py |
Create | Priority scoring, structured plan builder, rendering |
api/templates/ |
Create | Directory for Jinja2 templates |
api/templates/pre_class.jinja2 |
Create | Template for pre-class action plan |
api/templates/debrief.jinja2 |
Create | Template for post-assignment debrief |
api/main.py |
Modify | Add coaching endpoints |
This sequence diagram shows the complete flow when a student submits a response, tying together all five phases.
sequenceDiagram
participant S as Student
participant API as API Gateway
participant DE as Diagnostic Engine
participant IM as Intervention Manager
participant LP as Learning Profiler
participant PS as Problem Sequencer
participant EL as Event Log
S->>API: Submit response (problem_id, answer_text)
API->>DE: classify_response(problem, answer)
DE->>DE: Run classifier plugin
DE->>EL: Append ResponseEvent
DE-->>API: {correct, misconception_id, confidence}
API->>DE: bkt_update(concept, mastery, correct)
DE->>EL: Append MasteryUpdateEvent
DE-->>API: {new_mastery}
alt Misconception detected
API->>LP: get_profile(student_id)
LP->>EL: Query response + intervention history
LP-->>API: {modality_vector, error_pattern, decay_rate}
API->>IM: select_intervention(student_id, misconception_id, profile)
IM->>IM: Query escalation state machine
IM->>IM: Thompson sample over modality_vector
IM->>EL: Append InterventionAssignedEvent
IM-->>API: {modality, intervention_text, escalation_level}
end
API->>PS: next_problems(student_id, concept, count=3)
PS->>PS: IRT item selection (target P_correct ~ 0.70)
PS->>EL: Append AssignmentEvent
PS-->>API: [problem_ids]
API-->>S: {feedback, intervention, next_problems}
The architecture is domain-agnostic, but every deployment needs a domain configuration. This section specifies the pipeline for onboarding a new subject (e.g., reading comprehension, chemistry, music theory).
graph TB
subgraph Onboarding["Domain Onboarding Pipeline"]
direction TB
A1[1. Author Knowledge Graph\nConcepts + Prerequisites + BKT Params] --> A2[2. Define Misconception Taxonomy\nPer-concept diagnostic labels]
A2 --> A3[3. Write Intervention Catalog\n5 modalities per misconception]
A3 --> A4[4. Build Problem Bank\nWith IRT parameters + misconception tags]
A4 --> A5[5. Train or Configure Classifier\nFine-tuned model OR rule-based OR rubric]
A5 --> A6[6. Validate Domain Config\nAutomated checks for completeness]
A6 --> A7[7. Seed with Pilot Data\nOptional: bootstrap from existing grades]
end
subgraph Artifacts["Domain Configuration Artifacts"]
KG[knowledge_graph.json]
TX[taxonomy.json]
IC[interventions.json]
PB[problem_bank.json]
CL[classifier model or rules]
CF[domain_config.json]
end
A1 --> KG
A2 --> TX
A3 --> IC
A4 --> PB
A5 --> CL
A6 --> CF
Define concepts, prerequisite relationships, levels, and BKT parameters. The format matches
the existing data/knowledge_graph.json schema:
{
"metadata": {
"domain": "reading_comprehension",
"version": "1.0.0",
"mastery_threshold": 0.85,
"mastery_initial": 0.5
},
"concepts": [
{
"id": "main_idea",
"name": "Identifying Main Idea",
"level": 1,
"prerequisites": [],
"bkt_params": {"p_init": 0.20, "p_learn": 0.12, "p_guess": 0.25, "p_slip": 0.10}
}
]
}Tip
BKT parameters vary significantly by domain. Math has low guess rates (~0.10) because numeric answers are hard to guess. Reading comprehension with multiple choice has higher guess rates (~0.25). Calibrate these from pilot data when possible.
Each concept has a set of misconceptions with IDs, labels, descriptions, and examples:
{
"domain": "reading_comprehension",
"misconceptions": {
"main_idea": [
{
"id": "mi_first_sentence",
"label": "Assumes main idea is always the first sentence",
"description": "Student selects the first sentence of the passage regardless of content.",
"examples": []
}
]
}
}Five modalities per misconception, stored in interventions.json (schema defined in
Phase 1 above).
Every problem needs: problem_id, concept, problem_text, correct_answer, irt_b,
diagnostic_for, and optionally template_id + template_params.
Minimum 5 problems per concept. Target 15+ for domains where adaptive sequencing will be active.
The classifier is a plugin with four implementation options. The recommended default for new domains is the LLM-catalog classifier, which requires zero training data.
The current algebra classifier (DistilBERT, 91.1% accuracy) was trained on 687 labeled examples across 19 misconception types plus 'correct'. Adding a new domain (e.g., logic, chemistry) would require collecting 500+ labeled examples per domain, retraining, and expanding the label space. Every domain restarts the data flywheel. At 10 domains with 50 misconceptions each, you need a single model that handles 500+ classes (which degrades accuracy) or 10 separate models (which multiply maintenance cost).
The key insight is reframing misconception detection from a classification task (pick one of N labels from a flat space) to a reading comprehension task (given this student's error and a catalog of known misconceptions for this concept, which one explains the error?).
The misconception descriptions and examples already in knowledge_graph.json are exactly
the context an LLM needs. No training data required. Adding a new domain means writing
misconception descriptions in the knowledge graph, the same authoring work a subject
matter expert does at Step 2 anyway.
def detect_misconception_llm(
student_response: str,
problem_text: str,
correct_answer: str,
concept: dict,
llm_client,
) -> dict:
"""Detect misconceptions using LLM + knowledge graph catalog.
The LLM only sees misconceptions for the current concept (3-6 options),
not the entire taxonomy. This keeps the task tractable regardless of how
many total misconceptions exist across all domains.
"""
catalog = ""
for m in concept["misconceptions"]:
catalog += f"\n{m['id']}: {m['description']}\n"
for ex in m.get("examples", [])[:2]:
catalog += f" Example: {ex['problem']} -> wrong: {ex['wrong']}, correct: {ex['correct']}\n"
prompt = f"""A student answered a {concept['name']} problem incorrectly.
Problem: {problem_text}
Correct answer: {correct_answer}
Student's answer: {student_response}
Known misconceptions for this concept:
{catalog}
Which misconception best explains the student's error?
If none match, respond with "unknown".
Respond with ONLY the misconception ID."""
result = llm_client.generate(prompt)
return {"label": result.strip(), "confidence": 0.85}Why this scales:
| Dimension | Fine-tuned model | LLM-catalog |
|---|---|---|
| New domain | Collect 500+ examples, retrain | Write descriptions in knowledge graph |
| 500+ misconception types | Single model accuracy degrades | LLM only sees 3-6 options per concept |
| Wrong-answer templates | Hand-author per misconception | Not needed; LLM reasons from descriptions |
| Maintenance per domain | Retrain when misconceptions change | Edit JSON |
The optimal long-term architecture uses both approaches:
- Day 1 (new domain): LLM-only detection. Teacher adds concept + misconception descriptions. System works immediately.
- Data accumulates: every LLM classification is reviewed by the teacher through the dashboard. Teacher confirms or corrects. This creates labeled data organically.
- Threshold reached (~200-300 verified examples): fine-tune a lightweight domain-specific model for speed. Use it as the primary classifier. LLM becomes the fallback for new or rare misconceptions the fine-tuned model hasn't seen.
The teacher review loop already exists (they see diagnostics in the dashboard). That review doubles as training data curation at no extra effort.
All four approaches implement the same interface:
from abc import ABC, abstractmethod
class ClassifierPlugin(ABC):
"""Abstract base for all diagnostic classifiers.
The engine calls predict() for every student response and uses
supports_concept() to route responses to the correct classifier
when multiple are registered.
"""
@abstractmethod
def predict(self, problem_text: str, student_text: str, concept: dict | None = None) -> dict:
"""Return {label: str, confidence: float}.
label is a misconception_id from the domain taxonomy, or 'correct'.
confidence is a float in [0, 1].
concept is the full concept dict from the knowledge graph (required for
LLM-catalog and rubric classifiers; ignored by fine-tuned models).
"""
...
@abstractmethod
def supports_concept(self, concept_id: str) -> bool:
"""Whether this classifier handles the given concept."""
...| Approach | Training data | Latency | Cost/call | Accuracy | When to use |
|---|---|---|---|---|---|
| LLM-catalog | 0 examples | ~1-2s | ~$0.001 | Good (see Experiment 03) | Default for new domains |
| Fine-tuned transformer | 500+ labeled examples | ~10ms | $0 (local) | Best for well-represented domains | After data flywheel produces training data |
| Rule-based | 0 examples | < 1ms | $0 | High for structured responses | Multiple choice, numeric answers |
| Rubric-based (LLM) | 0 examples | ~2-3s | ~$0.002 | Good for open-ended | Essays, explanations |
class HybridClassifier(ClassifierPlugin):
"""Routes to fine-tuned model when available, LLM-catalog otherwise."""
def __init__(self, fine_tuned: dict[str, ClassifierPlugin], llm_fallback: ClassifierPlugin):
self._fine_tuned = fine_tuned # {domain_id: classifier}
self._llm = llm_fallback
def predict(self, problem_text: str, student_text: str, concept: dict | None = None) -> dict:
domain = concept.get("domain", "") if concept else ""
if domain in self._fine_tuned:
return self._fine_tuned[domain].predict(problem_text, student_text)
return self._llm.predict(problem_text, student_text, concept=concept)
def supports_concept(self, concept_id: str) -> bool:
return True # LLM fallback handles everythingRun an automated completeness check before deployment:
from collections import Counter
def validate_domain_config(domain_dir: str) -> list[str]:
"""Return a list of validation errors (empty = valid)."""
errors = []
kg = load_json(f"{domain_dir}/knowledge_graph.json")
taxonomy = load_json(f"{domain_dir}/taxonomy.json")
interventions = load_json(f"{domain_dir}/interventions.json")
problems = load_json(f"{domain_dir}/problem_bank.json")
# Every concept in the KG must have misconceptions in the taxonomy
for concept in kg["concepts"]:
if concept["id"] not in taxonomy["misconceptions"]:
errors.append(f"Concept {concept['id']} has no misconceptions in taxonomy")
# Every misconception must have interventions for all 5 modalities
for concept_id, misconceptions in taxonomy["misconceptions"].items():
for m in misconceptions:
if m["id"] not in interventions["interventions"]:
errors.append(f"Misconception {m['id']} has no interventions")
else:
modalities = set(interventions["interventions"][m["id"]].keys())
missing = {"visual", "concrete", "pattern", "verbal", "peer"} - modalities
if missing:
errors.append(f"Misconception {m['id']} missing modalities: {missing}")
# Every concept must have at least 5 problems
problem_counts = Counter(p["concept"] for p in problems)
for concept in kg["concepts"]:
if problem_counts.get(concept["id"], 0) < 5:
errors.append(f"Concept {concept['id']} has < 5 problems ({problem_counts.get(concept['id'], 0)} found)")
# Every problem must have irt_b and diagnostic_for
for p in problems:
if "irt_b" not in p:
errors.append(f"Problem {p['problem_id']} missing irt_b")
if "diagnostic_for" not in p:
errors.append(f"Problem {p['problem_id']} missing diagnostic_for")
return errorsIf the school has existing grade data, import it as response.submitted events to bootstrap
mastery estimates and intervention effectiveness priors. This gives the Thompson sampling
algorithm a warm start instead of exploring blindly.
RETRACTED (Experiments 07-09). This experiment's conclusions are invalid for two independent reasons. First, the comparison is apples-to-oranges: the fine-tuned model predicts concept-level labels (20 classes), while the catalog classifier predicts misconception IDs (a harder, finer-grained task). Second, experiments 07-09 demonstrated that the simulated student model cannot discriminate between high-quality and low-quality classification because learning gains are dominated by interaction count, not intervention quality. The raw accuracy numbers are correct but the implied conclusion ("use an LLM for new domains") is not supported by the evidence. See
experiments/07_classifier_error_propagation/notes.mdfor root cause analysis andexperiments/09_end_to_end_stress/notes.mdfor the definitive verdict.
Question. How does a heuristic catalog-based classifier (simulating the LLM-catalog approach) compare against the fine-tuned DistilBERT model on the same test set?
Design. Head-to-head evaluation on the existing test set
(data/dataset/test.json, 107 examples, 20 classes including "correct").
Since we cannot call a production LLM in an offline experiment, we simulate the LLM-catalog approach with a heuristic catalog classifier that:
- Loads the knowledge graph misconception descriptions and examples.
- For each test example, compares the student's incorrect answer against each misconception's example wrong answers for the same concept.
- Uses string similarity (normalized Levenshtein distance) between the student answer and each misconception's known wrong answers.
- Returns the misconception with the highest similarity, or "correct" if the student answer matches the correct answer.
This is a conservative lower bound on LLM-catalog performance: a real LLM would reason about the error semantically, not just lexically.
| Classifier | Description |
|---|---|
| Fine-tuned DistilBERT | models/classifier/best/, 91.1% accuracy, 20 classes |
| Heuristic catalog | String similarity against knowledge graph examples (proxy for LLM-catalog) |
| Majority baseline | Always predict the most common class |
| Random baseline | Predict uniformly from concept-appropriate misconceptions |
Primary metrics.
| Metric | Definition |
|---|---|
| Top-1 accuracy | Fraction of exact label matches |
| Concept-level accuracy | Correct concept identification (aggregated from misconception) |
| Per-class F1 | Weighted and macro F1 scores |
| Confidence calibration | Expected Calibration Error (ECE) across confidence bins |
| Latency | Mean inference time per example |
Performance axis. Break down accuracy by: (a) concept, (b) number of training examples available for each misconception class, (c) whether the test example's misconception has examples in the knowledge graph.
Scalability axis. Vary the number of misconception descriptions available to the catalog classifier (from 1 example per misconception to all available) and plot accuracy vs catalog richness.
Expected outcome. The fine-tuned model should win on top-1 accuracy (~91% vs ~50-65% for heuristic catalog). But the catalog classifier should achieve reasonable concept-level accuracy (~70-80%) with zero training data, validating the LLM-catalog premise that a real LLM would bridge the gap.
Artifacts.
experiments/03_catalog_vs_finetuned/artifacts/results.jsonexperiments/03_catalog_vs_finetuned/artifacts/accuracy_comparison.pngexperiments/03_catalog_vs_finetuned/artifacts/per_class_f1.pngexperiments/03_catalog_vs_finetuned/artifacts/catalog_scaling.png
Reproducibility. python experiments/03_catalog_vs_finetuned/run.py (seed=42).
Experiments 01-06 tested components in isolation. Experiments 07-09 ask the systems-level questions: how do subsystem errors propagate through the full pipeline, how accurate are internal state estimates, and what happens when multiple subsystems degrade simultaneously?
Question. How do classifier errors at controlled rates (0-50%) propagate through the tutoring pipeline to degrade learning?
Design. Inject four error types (misidentification, false negative, false positive, concept misroute) at nine rate levels. Run 300 students through 40 interactions each. Measure test score gain, BKT estimation error, misconception resolution rate, and wasted interventions.
Key finding. The system is almost completely insensitive to classifier errors. At 50% misidentification, gain drops only 9%. False positives actually improve gains (+35%). This is a simulation validity problem, not a system robustness finding. See notes.md for root cause analysis.
Artifacts. experiments/07_classifier_error_propagation/
Question. How accurately does BKT track the student's true p_know, and how does parameter misspecification affect concept selection?
Design. Part A tracks BKT vs true p_know over time. Part B perturbs each BKT parameter from 0.25x to 3.0x. Part C perturbs all parameters simultaneously.
Key finding. Concept selection accuracy is 26.4% (random = 20%). BKT RMSE barely improves over 40 interactions (0.376 to 0.358). 3x parameter perturbation changes gains by <2 percentage points. BKT is decorative in this configuration.
Artifacts. experiments/08_bkt_estimation_fidelity/
Question. When classifier error, BKT misspecification, and concept selection noise degrade simultaneously, do errors compound?
Design. Factorial sweep: 5 classifier error rates x 4 BKT scales x 4 concept noise rates = 80 conditions. 200 students per condition.
Key finding. All 80 conditions are within 12% of baseline. Higher error rates sometimes improve outcomes. The system cannot fail because the simulation cannot differentiate good from bad tutoring.
Artifacts. experiments/09_end_to_end_stress/
Experiments 07-09 converge on a single conclusion: the simulated student model is not a valid test bed for evaluating tutoring system quality.
Root causes:
- Oversaturated interaction budget. 40 interactions for 5 concepts lets even random routing reach mastery.
- Unconditional learning.
receive_instruction()always increases p_know regardless of instruction quality. - Test score insensitivity. The test measures p_know (which always rises), not misconception resolution.
- Trivially small knowledge graph. 5 linear concepts make routing nearly impossible to do wrong.
- Misaligned incentives. Wrong classification gives a free 2x learning bonus because only the presence of a misconception ID triggers it, not its correctness.
- Conditional learning: check if targeted misconception matches an active misconception. If mismatched, apply a confusion penalty.
- Tighter budget: 10-15 total interactions, not 40.
- Misconception-aware testing: test items that probe specific misconceptions.
- Larger concept graph: 15-25 concepts with branching prerequisites.
- Negative transfer: wrong instruction should sometimes strengthen misconceptions.
The current simulated RCT evaluates a system without agentic layers. Adding them changes what needs to be measured:
| Current metric | Limitation | Replacement metric |
|---|---|---|
| Mastery gain (BKT) | Circular: BKT evaluates BKT | Pre/post aligned assessment scores |
| Misconception resolution rate (12.4%) | Only tracks detection, not intervention | Resolution rate per intervention modality |
| Effect size d=0.33-0.48 | Simulated students, no real learning | Effect size on human pre/post test |
| Concepts mastered count | Quantity over quality | Transfer test: apply concepts in novel contexts |
Each intervention decision is a treatment assignment. The system must support:
Within-student randomization for modality selection. The Thompson sampling already provides this: it occasionally explores non-optimal modalities, generating natural randomization. Track the "counterfactual" by logging what the greedy policy would have selected alongside what was actually selected.
Between-group randomization for sequencing strategies. Assign classrooms to sequencing conditions (IRT-based vs. current "easiest first") and compare mastery gains.
When randomization is not possible (common in classroom settings), use off-policy estimation:
Inverse Propensity Scoring (IPS):
Where
This lets you answer "what would have happened if we'd used a different modality selection strategy?" from observational data collected during normal operation.
| Phase | Metric | Target |
|---|---|---|
| Phase 1 | Misconception resolution rate by modality | Track; no target (baseline year) |
| Phase 1 | Avg. escalation level at resolution | < 2.0 (most resolve within 2 attempts) |
| Phase 2 | Profile classification accuracy | Validate against teacher manual classification on 50 students |
| Phase 3 | Problem appropriateness (% where |
> 70% |
| Phase 4 | Co-occurrence prediction accuracy | Validate discovered prerequisites against teacher judgment |
| Phase 5 | Teacher plan adoption rate | > 60% of plans acknowledged |
| Phase 5 | Teacher-reported usefulness (Likert 1-5) | >= 4.0 |
gantt
title Implementation Phases
dateFormat YYYY-MM-DD
axisFormat %b %d
section Phase 0 - Event Store Migration
Event log schema + migration script :p0a, 2026-04-01, 5d
Materialized view builders :p0b, after p0a, 4d
Backfill existing responses :p0c, after p0b, 2d
section Phase 1 - Intervention Intelligence
intervention_log table + API endpoints :p1a, after p0c, 4d
Escalation state machine :p1b, after p1a, 5d
Thompson sampling modality selector :p1c, after p1b, 4d
Intervention outcome tracking :p1d, after p1c, 3d
section Phase 2 - Learning Profiler
Error pattern classifier :p2a, after p1b, 4d
Temporal pattern extraction :p2b, after p2a, 4d
Modality effectiveness vector :p2c, after p1d, 3d
Profile API endpoint :p2d, after p2c, 2d
section Phase 3 - Adaptive Sequencing
IRT parameter estimation pipeline :p3a, after p1d, 5d
Adaptive item selection algorithm :p3b, after p3a, 5d
Problem template engine :p3c, after p3b, 4d
Spaced repetition scheduler :p3d, after p3c, 3d
section Phase 4 - Cross-Student Analysis
Co-occurrence matrix computation :p4a, after p2d, 4d
Intervention effectiveness dashboard :p4b, after p4a, 3d
Cohort comparison engine :p4c, after p4b, 3d
section Phase 5 - Coaching Agent
Priority scoring algorithm :p5a, after p4b, 4d
Structured plan builder :p5b, after p5a, 4d
Template renderer (Jinja2) :p5c, after p5b, 3d
LLM narrative renderer (optional) :p5d, after p5c, 4d
Pre-class & debrief endpoints :p5e, after p5d, 3d
Phase 0 (event store) is a hard prerequisite for everything. After Phase 0:
- Phases 1 and 3 can start in parallel (intervention intelligence and adaptive sequencing share no code dependencies, only the event store).
- Phase 2 depends on Phase 1 (needs intervention outcome data for modality vector).
- Phase 4 depends on Phases 1 and 2 (needs accumulated intervention + profile data).
- Phase 5 depends on all prior phases (it is the integration point).
Each phase section lists a "Files to create or modify" table. Pick a file from any unlocked phase and implement it. Every file has a clear interface (function signatures, return types, SQL schemas). Write tests first: the event store means you can test in isolation by seeding the events table with synthetic data and asserting the component produces correct output.
Suggested starting points by complexity:
| Task | Complexity | Good first task for |
|---|---|---|
Phase 0: api/events.py |
Low | Learning the codebase |
| Phase 0: Migration script | Medium | Understanding the current schema |
Phase 1: data/interventions.json |
Low | Content authoring, no code |
| Phase 1: Escalation state machine | Medium | State machine design, algorithms |
| Phase 1: Thompson sampling selector | Medium | Probability, Beta distributions |
| Phase 2: Error pattern classifier | Medium | Logic, data analysis |
| Phase 3: IRT parameter estimation | High | Statistical modeling, scipy |
| Phase 3: Problem template engine | Low | String templating, randomization |
| Phase 5: Jinja2 template renderer | Low | Templating, frontend-adjacent |
| Phase 5: Priority scoring algorithm | Medium | Weighted scoring, systems design |
| File | Phase | Purpose |
|---|---|---|
api/events.py |
0 | Event log append/query functions |
api/migrate_to_events.py |
0 | One-time migration from mutable tables to event sourcing |
api/intervention_manager.py |
1 | Escalation state machine, modality selection, Thompson sampling |
data/interventions.json |
1 | Domain-specific intervention catalog (5 modalities per misconception) |
api/learning_profiler.py |
2 | Error pattern classification, temporal analysis, modality vector |
api/problem_sequencer.py |
3 | IRT-based item selection, spaced repetition, template engine |
api/pattern_analyzer.py |
4 | Co-occurrence matrix, effectiveness aggregation, cohort comparison |
api/coaching_agent.py |
5 | Priority scoring, structured plan builder, rendering |
api/templates/pre_class.jinja2 |
5 | Template for pre-class action plan |
api/templates/debrief.jinja2 |
5 | Template for post-assignment debrief |
| Phase | Method | Path | Description |
|---|---|---|---|
| 1 | GET |
/api/students/{id}/interventions |
Intervention history |
| 1 | GET |
/api/students/{id}/interventions/active |
Currently active interventions |
| 1 | POST |
/api/students/{id}/interventions/assign |
Trigger escalation state machine |
| 1 | PATCH |
/api/interventions/{id}/outcome |
Record outcome |
| 1 | GET |
/api/classrooms/{id}/intervention-effectiveness |
Class-wide effectiveness rates |
| 2 | GET |
/api/students/{id}/profile |
Learning profile |
| 2 | GET |
/api/classrooms/{id}/profiles |
All profiles in a classroom |
| 3 | POST |
/api/students/{id}/generate-assignment |
Personalized problem set |
| 3 | GET |
/api/problems/{id}/irt |
IRT parameters |
| 3 | POST |
/api/problems/estimate-difficulty |
Re-estimate IRT params |
| 4 | GET |
/api/classrooms/{id}/misconception-cooccurrence |
Co-occurrence matrix |
| 4 | GET |
/api/analytics/cohort-comparison |
Cross-classroom comparison |
| 4 | GET |
/api/analytics/discovered-prerequisites |
Proposed knowledge graph edges |
| 5 | GET |
/api/classrooms/{id}/coaching/pre-class |
Pre-class action plan |
| 5 | GET |
/api/classrooms/{id}/coaching/debrief/{assignment_id} |
Post-assignment debrief |
| 5 | POST |
/api/students/{id}/coaching/parent-narrative |
Parent conference narrative |
| 5 | PATCH |
/api/coaching/plans/{id}/acknowledge |
Acknowledge/modify plan |