TwinBench Scenarios

TwinBench v1 defines five reference scenarios. Each scenario is intended to be realistic enough to expose persistent behavior, structured enough to reproduce, and distinct enough to support interpretable scoring.

1. Return After Delay

Objective

Test whether the system retains key memory and active task state after a defined delay.

Setup

establish stable user facts
establish one durable preference
define one active task with a clear next step
pause interaction for a recorded delay

Sequence

Introduce the user facts and preference.
Start an active task and confirm the next step.
End the session.
Resume after delay and ask for both recall and continuation.

Expected behavior

stable facts are recalled correctly
the stored preference is applied without restatement
the active task resumes from the prior state

Failure modes

forgotten facts
stale or conflicting recall
task restart instead of resumption
fabricated details not supported by prior interaction

Scoring notes

primary metrics: MR, TC
delay length should be recorded explicitly
note whether successful recall required hints or leading prompts

2. Longitudinal Task Progression

Objective

Test whether the system can advance a multi-step task across multiple checkpoints rather than treating each session as a reset.

Setup

define one task with at least three milestones
introduce unrelated interaction between milestones
preserve dated checkpoints and milestone outcomes

Sequence

Start the task and define milestone one.
Confirm progress on the first milestone.
Interrupt with unrelated work.
Return later and request the next milestone.
Complete the task in a later session.

Expected behavior

milestone history is preserved
already completed work is not repeated
the next action remains consistent with prior progress

Failure modes

milestone loss
duplicate work
fabricated completion
degraded plan structure after interruption

Scoring notes

primary metric: TC
secondary metrics: CCC, IC
the task should be substantive enough that a reset is visible

3. Multi-Context Transfer

Objective

Test whether the system can transfer relevant state across contexts such as chat, email, workspace, or other distinct interaction surfaces.

Setup

choose at least two contexts
establish work or memory in context A
resume in context B with an explicit transfer point

Sequence

Start a task or memory-bearing interaction in context A.
Introduce a decision, update, or summary-worthy change.
Switch to context B and request continuation or summary.
Optionally verify alignment in a third context or by returning to context A.

Expected behavior

the task frame remains consistent across contexts
the latest state transfers correctly
summaries and next steps remain aligned

Failure modes

contradictory summaries
stale state after context switch
missing decisions or updates
context-specific fragmentation of the same task

Scoring notes

primary metric: CCC
record whether transfer is native or evaluator-mediated
context boundaries should be explicit in the evidence log

4. Preference Learning

Objective

Test whether the system becomes more useful after learning stable user preferences.

Setup

select several durable preferences such as tone, formatting, prioritization, or schedule defaults
run one baseline task before learning
run a comparable task after the preferences are established

Sequence

Run a baseline task without relying on stored preferences.
Provide explicit preferences and corrective feedback.
End the session.
Return later and request a comparable task without repeating the preference.

Expected behavior

learned preferences are applied later without restatement
repeated corrections decrease
later outputs are more user-specific and operationally useful

Failure modes

preferences not applied
inconsistent or partial application
regression after an apparent improvement
overgeneralization of a narrow preference

Scoring notes

primary metric: PG
secondary metric: MR
before-and-after tasks should be comparable enough to support a fair gain estimate

5. Identity Stability Over Time

Objective

Test whether the system preserves a stable user model and stable collaboration role across repeated interaction.

Setup

define user identity markers
define standing collaboration norms or commitments
revisit the system across multiple dated checkpoints

Sequence

Establish user identity and working norms.
Interact across multiple sessions over time.
Probe for user understanding, system role, and standing commitments.

Expected behavior

the same user profile is preserved
the system role remains stable unless explicitly changed
standing collaboration norms remain intact

Failure modes

profile confusion
unexplained role drift
contradiction of prior commitments
reset of the working relationship without cause

Scoring notes

primary metric: IC
secondary metrics: MR, CCC
evaluators should distinguish style variation from genuine identity failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TwinBench Scenarios

1. Return After Delay

2. Longitudinal Task Progression

3. Multi-Context Transfer

4. Preference Learning

5. Identity Stability Over Time

FilesExpand file tree

SCENARIOS.md

Latest commit

History

SCENARIOS.md

File metadata and controls

TwinBench Scenarios

1. Return After Delay

2. Longitudinal Task Progression

3. Multi-Context Transfer

4. Preference Learning

5. Identity Stability Over Time