Hawthorne Protocol

Can AI systems detect when they're being evaluated?

A research paper and reference implementation exploring the Hawthorne Effect for AI -- the risk that AI systems behave differently during benchmarks, red-team tests, and safety evaluations than they do in production.

The Problem

The Hawthorne Effect: subjects change behavior when they know they're being observed. AI systems receive enough contextual signals (prompt formatting, timing patterns, sandbox indicators, evaluation harness artifacts) to distinguish testing from production with non-trivial accuracy. If they can detect evaluations, evaluation results may not reflect real-world behavior.

This undermines:

Red-team testing (model refuses attacks during tests, not in production)
Benchmark scores (performance gaming or sandbagging)
Safety evaluations (deceptive alignment)
Regulatory compliance (EU AI Act testing requirements)

Paper

Read the full paper: paper/hawthorne-protocol.md

Introduction -- The Hawthorne Effect applied to AI systems
Evaluation Detection Indicators -- 22 signals across 5 categories
Threat Model -- Performance gaming, sandbagging, deceptive alignment
The Hawthorne Protocol -- 5 principles for evaluation integrity
Implications -- Red-teaming, benchmarks, regulation, alignment research

Reference Implementation

Detector

Analyzes interaction sequences for evaluation context signals:

import { HawthorneDetector } from '@15rl/hawthorne-protocol';

const detector = new HawthorneDetector();

const result = detector.detect([
  { id: '1', timestamp: 1000, prompt: 'Q: Capital of France?\n(A) London\n(B) Paris' },
  { id: '2', timestamp: 1200, prompt: 'Q: Capital of Germany?\n(A) London\n(B) Berlin' },
  { id: '3', timestamp: 1400, prompt: 'Q: Capital of Spain?\n(A) London\n(B) Madrid' },
]);

console.log(result.score);    // 0.82
console.log(result.verdict);  // 'likely_evaluation'
console.log(result.summary);
// Evaluation probability: 82% (likely evaluation)
// Top signals:
//   - [PROMPT-001] Systematic variation: 3 prompts with 89% average similarity
//   - [PROMPT-002] Benchmark formatting: 3/3 prompts use benchmark-style formatting
//   - [TEMP-001] Rapid sequential queries: Average interval: 200ms

Mitigations

Countermeasures for evaluation designers to make tests indistinguishable from production:

import { applyAllMitigations } from '@15rl/hawthorne-protocol';

const hardened = applyAllMitigations(evaluationInteractions);
// Randomized formatting, realistic timing, benign traffic mixed in,
// novel payload variants, evaluation artifacts stripped

Indicator Catalog

22 indicators across 5 categories:

Category	Count	Examples
Prompt	5	Systematic variation, benchmark formatting, adversarial probing
Temporal	4	Machine-speed arrival, uniform timing, batch patterns
Content	4	Known payloads, escalation patterns, synthetic personas
Context	4	Sandbox detection, reduced tools, fresh session state
Meta	4	Harness artifacts, scoring rubrics, comparison framing

Part of the Authensor Ecosystem

This project is part of the Authensor open-source AI safety ecosystem, built by 15 Research Lab.

Project	Description
Authensor	The open-source safety stack for AI agents
Prompt Injection Benchmark	Standardized benchmark for safety scanners
AI SecLists	Security wordlists and payloads for AI/LLM testing
ATT&CK ↔ Alignment Rosetta	Maps MITRE ATT&CK to AI alignment concepts
Agent Forensics	Post-incident analysis for receipt chains
Behavioral Fingerprinting	Statistical behavioral drift detection
Attack Surface Mapper	Map the attack surface of your AI agents

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
paper		paper
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hawthorne Protocol

The Problem

Paper

Contents

Reference Implementation

Detector

Mitigations

Indicator Catalog

Part of the Authensor Ecosystem

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hawthorne Protocol

The Problem

Paper

Contents

Reference Implementation

Detector

Mitigations

Indicator Catalog

Part of the Authensor Ecosystem

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages