Haiku Protocol

Lossless semantic compression for LLM context windows.

The Haiku Protocol is a Controlled Natural Language (CNL) compression system that transforms verbose technical documentation into dense, machine-optimized strings while preserving 100% semantic meaning. It works like "minification" for natural language — just as developers minify JavaScript to make websites faster, Haiku Protocol minifies prose to make AI context denser.

Original (23 tokens):
  "To restart the server, you must first ensure that the configuration
   file is saved, and then you can execute the reboot command."

Haiku Protocol (10 tokens):
  Action:Restart_Server REQUIRES State:Config_Saved -> EXEC:Reboot_Cmd

The Problem

LLM context windows are expensive and finite. A 128k-token window isn't actually 128k tokens of knowledge — it's closer to 70k tokens of knowledge wrapped in 58k tokens of human-readable packaging: articles, transitions, filler phrases, and polite grammar. Technical documentation wastes roughly 40% of tokens on structural "fluff" that machines don't need.

The Solution

A two-stage compression pipeline that encodes human-readable documentation into a structured, grammar-defined shorthand — then decodes it back on demand.

┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  Human Docs     │ ──▶  │  Haiku Encoder  │ ──▶  │  CNL Storage    │
│  (Verbose)      │      │  (Compression)  │      │  (Dense)        │
└─────────────────┘      └─────────────────┘      └─────────────────┘
                                                          │
                                                          ▼
┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  Human Answer   │ ◀──  │  Haiku Decoder  │ ◀──  │  LLM + Context  │
│  (Readable)     │      │  (Expansion)    │      │  (Query)        │
└─────────────────┘      └─────────────────┘      └─────────────────┘

The Encoder breaks documents into semantic chunks, extracts key entities and relationships, and synthesizes them into CNL statements governed by a formal grammar. The Decoder teaches the LLM to interpret CNL and expand it back into natural language answers.

Planned Features

Custom CNL grammar with formally defined operators, syntax rules, and validation — designed by a Technical Writer, not just an algorithm
Multi-strategy document chunking that respects semantic boundaries rather than arbitrary token limits
LLM-assisted entity extraction to identify the nouns, verbs, and relationships that carry actual meaning
Compression quality metrics including token reduction ratio, semantic similarity scoring, and information retention measurement
Benchmark comparison against LLMLingua and other compression baselines to prove effectiveness
Interactive web demo for live compression with a visual metrics dashboard
Round-trip fidelity — compressed CNL can be decoded back to human-readable text without information loss

Development Roadmap

The project follows a phased, spec-first development methodology:

Phase	Focus	Description
Research	CNL Design	Literature review, grammar specification, benchmarking strategy
Environment	Foundation	Development environment, dependencies, API configuration, project scaffolding
Encoder	Core Engine	Document processing pipeline — chunking, extraction, synthesis, validation
Demo	Integration	Web interface, comprehensive test suite, benchmark integration
Release	Polish	Documentation, architecture write-up, public release

Goals

Compression ratio of 50%+ on procedural technical documentation
Semantic similarity above 0.85 between original and decoded output
Competitive or superior results vs. LLMLingua baseline
Working web demo with live compression and metrics dashboard

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.agent/workflows		.agent/workflows
benchmarks		benchmarks
diagrams		diagrams
docs/design		docs/design
examples		examples
research		research
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
BASELINE_METRICS_REPORT.md		BASELINE_METRICS_REPORT.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
LITERATURE_REVIEW.md		LITERATURE_REVIEW.md
README.md		README.md
RESEARCH_REPORT.md		RESEARCH_REPORT.md
STYLE_GUIDE.md		STYLE_GUIDE.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Haiku Protocol

The Problem

The Solution

Planned Features

Development Roadmap

Goals

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

southpawriter02/haiku-protocol

Folders and files

Latest commit

History

Repository files navigation

Haiku Protocol

The Problem

The Solution

Planned Features

Development Roadmap

Goals

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages