Skip to content
This repository was archived by the owner on May 14, 2026. It is now read-only.

AUTHENSOR/hawthorne-protocol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Hawthorne Protocol

Can AI systems detect when they're being evaluated?

A research paper and reference implementation exploring the Hawthorne Effect for AI -- the risk that AI systems behave differently during benchmarks, red-team tests, and safety evaluations than they do in production.

The Problem

The Hawthorne Effect: subjects change behavior when they know they're being observed. AI systems receive enough contextual signals (prompt formatting, timing patterns, sandbox indicators, evaluation harness artifacts) to distinguish testing from production with non-trivial accuracy. If they can detect evaluations, evaluation results may not reflect real-world behavior.

This undermines:

  • Red-team testing (model refuses attacks during tests, not in production)
  • Benchmark scores (performance gaming or sandbagging)
  • Safety evaluations (deceptive alignment)
  • Regulatory compliance (EU AI Act testing requirements)

Paper

Read the full paper: paper/hawthorne-protocol.md

Contents

  1. Introduction -- The Hawthorne Effect applied to AI systems
  2. Evaluation Detection Indicators -- 22 signals across 5 categories
  3. Threat Model -- Performance gaming, sandbagging, deceptive alignment
  4. The Hawthorne Protocol -- 5 principles for evaluation integrity
  5. Implications -- Red-teaming, benchmarks, regulation, alignment research

Reference Implementation

Detector

Analyzes interaction sequences for evaluation context signals:

import { HawthorneDetector } from '@15rl/hawthorne-protocol';

const detector = new HawthorneDetector();

const result = detector.detect([
  { id: '1', timestamp: 1000, prompt: 'Q: Capital of France?\n(A) London\n(B) Paris' },
  { id: '2', timestamp: 1200, prompt: 'Q: Capital of Germany?\n(A) London\n(B) Berlin' },
  { id: '3', timestamp: 1400, prompt: 'Q: Capital of Spain?\n(A) London\n(B) Madrid' },
]);

console.log(result.score);    // 0.82
console.log(result.verdict);  // 'likely_evaluation'
console.log(result.summary);
// Evaluation probability: 82% (likely evaluation)
// Top signals:
//   - [PROMPT-001] Systematic variation: 3 prompts with 89% average similarity
//   - [PROMPT-002] Benchmark formatting: 3/3 prompts use benchmark-style formatting
//   - [TEMP-001] Rapid sequential queries: Average interval: 200ms

Mitigations

Countermeasures for evaluation designers to make tests indistinguishable from production:

import { applyAllMitigations } from '@15rl/hawthorne-protocol';

const hardened = applyAllMitigations(evaluationInteractions);
// Randomized formatting, realistic timing, benign traffic mixed in,
// novel payload variants, evaluation artifacts stripped

Indicator Catalog

22 indicators across 5 categories:

Category Count Examples
Prompt 5 Systematic variation, benchmark formatting, adversarial probing
Temporal 4 Machine-speed arrival, uniform timing, batch patterns
Content 4 Known payloads, escalation patterns, synthetic personas
Context 4 Sandbox detection, reduced tools, fresh session state
Meta 4 Harness artifacts, scoring rubrics, comparison framing

Part of the Authensor Ecosystem

This project is part of the Authensor open-source AI safety ecosystem, built by 15 Research Lab.

Project Description
Authensor The open-source safety stack for AI agents
Prompt Injection Benchmark Standardized benchmark for safety scanners
AI SecLists Security wordlists and payloads for AI/LLM testing
ATT&CK ↔ Alignment Rosetta Maps MITRE ATT&CK to AI alignment concepts
Agent Forensics Post-incident analysis for receipt chains
Behavioral Fingerprinting Statistical behavioral drift detection
Attack Surface Mapper Map the attack surface of your AI agents

License

MIT

About

Can AI systems detect when they're being evaluated? Research paper and reference implementation exploring the Hawthorne Effect for AI.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors