Research Log

Everything we tried, measured, and learned while building this library.

For the current compact browser-accuracy / benchmark snapshot, see STATUS.md. For the current compact corpus / sweep snapshot, see corpora/STATUS.md. For the shared mismatch vocabulary, see corpora/TAXONOMY.md.

Current steering summary

This log is historical. The current practical steering picture is:

Japanese has two real canaries (羅生門, 蜘蛛の糸), both clean at anchor widths and both still exposing a small positive one-line field on broader Chrome sweeps.
Chinese has two long-form canaries (祝福, 故鄉) showing the same broad Chrome-positive / Safari-clean split, with real font sensitivity between Songti SC and PingFang SC.
Myanmar still has two real canaries with residual Chrome/Safari disagreement around quote/follower-style classes, so it remains the main unresolved Southeast Asian frontier.
Urdu has a real Nastaliq/Naskh canary (چغد) with the same narrow-width negative field in Chrome and Safari, so it is clearly a shaping/context class rather than dirty data or a browser-only quirk. It remains parked rather than actively tuned.
Arabic coarse corpora are clean; the remaining work there is mostly a fine-width edge-fit class, not the old preprocessing/corpus-hygiene problems.
Mixed app text still matters because it catches product-shaped classes that books miss, especially soft-hyphen and extractor-sensitive cases.

The problem: DOM measurement interleaving

When UI components independently measure text heights with DOM reads like getBoundingClientRect(), each read can force synchronous layout. If those reads interleave with writes, the browser can end up relaying out the whole document repeatedly.

The goal here was always the same:

do the expensive text work once in prepare()
keep layout() arithmetic-only
make resize-driven relayout cheap and coordination-free

Approach 1: Canvas measureText + word-width caching

Canvas measureText() avoids DOM layout. It goes straight to the browser's font engine.

That led to the basic two-phase model:

prepare(text, font) — segment text, measure segments, cache widths
layout(prepared, maxWidth, lineHeight) — walk cached widths with pure arithmetic

That architecture held up. The broad browser sweeps are now clean in Chrome, Safari, and Firefox, and the hot layout() path is still the core product win.

Rejected: DOM-based or string-reconstruction measurement in the hot path

Several alternatives were tried and rejected:

measuring full candidate lines as strings during layout()
moving measurement into hidden DOM elements during prepare()
using SVG getComputedTextLength()

The pattern was consistent:

they either reintroduced DOM reads
or they were slower than the current two-phase model
or they looked cleaner locally but regressed the actual benchmark path

The important keep was architectural, not algorithmic:

layout() stayed arithmetic-only on cached widths

Discovery: system-ui font resolution mismatch

Canvas and DOM resolve system-ui to different font variants on macOS at certain sizes:

Machine-readable scan:

research-data/system-ui-size-scan.json

In the recorded scan, mismatches clustered at 10-12px, 14px, and 26px. 13px, 15-25px, and 27-28px were exact.

macOS uses SF Pro Text at smaller sizes and SF Pro Display at larger sizes. Canvas and DOM switch between them at different thresholds.

Practical conclusion:

use a named font if accuracy matters
keep system-ui documented as unsafe
if we ever support it properly, the believable path is a narrow prepare-time DOM fallback for detected bad tuples

What did not look trustworthy enough:

lookup tables
naive scaling
guessed resolved-font substitution

Discovery: word-by-word sum accuracy

Canvas is internally consistent enough that summing measured segments works very well, but not perfectly. Over a full paragraph, tiny adjacency differences can accumulate into a line-edge error.

The keeps were small and semantic:

merge punctuation into the preceding word before measuring
let trailing collapsible spaces hang instead of forcing a break

What did not survive:

full-string verification in layout()
uniform rescaling
generic pair-level correction models

The broad lesson was that local semantic preprocessing paid off more than clever runtime correction.

Discovery: text-shaper is a useful reference, not a runtime replacement

text-shaper was useful reference material, especially for Unicode coverage and bidi ideas, but not a replacement for the current browser-facing model.

What was worth taking:

broader Unicode coverage, e.g. missing CJK extension blocks

What was not worth taking:

its segmentation as a runtime replacement for Intl.Segmenter
its paragraph breaker as a substitute for browser-parity layout

Bottom line:

good reference material
wrong runtime center of gravity for this repo

Discovery: preserving ordinary spaces, hard breaks, and numeric tab stops is viable

The smallest honest second whitespace mode turned out to be:

preserve ordinary spaces
preserve \n hard breaks
preserve tabs with default browser-style tab stops
leave the other wrapping defaults alone

That became:

{ whiteSpace: 'pre-wrap' }

What mattered:

preserved spaces still hang at line end
consecutive hard breaks keep empty lines
a trailing final hard break does not invent an extra empty line
tabs advance to the next default browser tab stop from the current line start

The mode now covers the textarea-like cases we cared about, and the broad browser sweeps plus the dedicated pre-wrap oracle are green.

One important tooling lesson also came out of this:

keep a small permanent oracle suite
justify it once with a broader brute-force validation pass
do not keep the brute-force pass forever once it has done its job

Discovery: emoji canvas/DOM width discrepancy

Chrome and Firefox on macOS can measure emoji wider in canvas than in DOM at small sizes. Safari does not share the same discrepancy.

What held up:

detect the discrepancy by comparing canvas emoji width against actual DOM emoji width per font
cache that correction
keep it outside the hot layout path

This is now one of the small browser-profile shims that is actually justified.

Retired HarfBuzz probe path

We briefly kept a headless HarfBuzz backend in the repo for server-side measurement probes.

What it taught us:

it was useful for research and algorithm probes
it was not close enough to our active browser-grounded path to justify keeping it in the main repo
isolated Arabic words in that probe path needed explicit LTR direction to avoid misleading widths

So if HarfBuzz comes up again later, treat it as explored territory:

useful as a research reference
not the runtime direction for Pretext
not a substitute for browser-oracle or browser-canvas validation

Final browser sweep closure

The last browser mismatches were not fixed by moving more work into layout(). That regressed the hot path and was reverted.

What actually held up:

better preprocessing in prepare()
better browser diagnostics pages and scripts
a tiny browser-specific line-fit tolerance

What did not change:

layout() stayed arithmetic-only

That remains the right center of gravity for the project.

Arabic frontier

Arabic took several passes, but the pattern is clearer now.

What survived:

merge no-space Arabic punctuation clusters during prepare()
- e.g. فيقول:وعليك, همزةٌ،ما
treat Arabic punctuation-plus-mark clusters like ،ٍ as left-sticky too
split " " + combining marks into plain space plus marks attached to the following word
use normalized slices and the exact corpus font during probe work
trust the better RTL diagnostics path instead of reconstructing offsets from rendered line text
clean obvious corpus/source artifacts instead of inventing new engine rules for them
allow a tiny non-Safari line-fit tolerance bump for the remaining positive fine-width field

What did not survive:

pair correction models at segment boundaries
larger Arabic run-slice width models
broad phrase-level heuristics derived from one good-looking probe

Those failed for the same reason in different sizes:

pair corrections were too local to move the real misses
run-slice widths were much heavier and still did not move the hard widths enough
both made prepare() or layout() materially worse without buying a clean Arabic field

So the useful guardrail is:

if an Arabic idea starts by adding more shaping-aware width caches inside the current segment-sum architecture, be skeptical early
the Arabic keeps so far have been preprocessing, corpus cleanup, diagnostics, and tiny tolerance shims, not richer width-cache models

Current read:

Arabic coarse corpora are healthy
the remaining work is much narrower now
the unresolved class looks like a mix of fine-width edge-fit and shaping/context, not another obvious preprocessing hole

Long-form corpus canaries

Once the main browser sweep became a regression gate, the long-form corpora became the real steering canaries.

Mixed app text

This is the most product-shaped canary.

What it has been good for:

URL/query-string handling
escaped quote clusters
numeric expressions like २४×७
time ranges like 7:00-9:00
emoji ZWJ runs
manual soft hyphens

Important keep:

model URL/query strings as narrow structured units, not one giant breakable blob

Current status:

almost entirely clean
one remaining extractor-sensitive soft-hyphen miss around 710px still looks paragraph-scale or accumulation-sensitive rather than like a neat local bug

Thai

Thai exposed a product-shaped ASCII quote issue more than a dictionary-segmentation failure.

The keep:

contextual ASCII quote glue during preprocessing

Result:

two Thai prose corpora are healthy at anchor widths
maintained step10 sweeps stayed clean enough that Thai now looks broader than one lucky story

Khmer

Khmer broadened the Southeast Asian class without immediately demanding new engine work.

The keep:

preserve explicit zero-width separators from the source text

Result:

anchor widths and the maintained step10 sweep were clean enough to keep Khmer as a real canary

Lao (rejected)

The Lao corpus attempt was a source problem, not an engine problem.

The raw text was wrapped print/legal text, which made it a dirty white-space: normal canary. We rejected it instead of normalizing nonsense into the repo.

Myanmar

Myanmar is still the main unresolved Southeast Asian frontier.

What survived:

treat ၊ / ။ / ၍ / ၌ / ၏ as left-sticky during preprocessing
treat ၏ as medial glue in clusters like ကျွန်ုပ်၏လက်မ

What did not survive:

broad Myanmar grapheme breaking in ordinary wrapping
quote-follower glue like closing-quote + ဟု

Current read:

there are real recurring classes here
but the obvious tempting heuristics improved one browser and hurt another
that makes Myanmar a canary, not a license for more instinctive glue rules

Japanese

Japanese gave us one real semantic keep:

kana iteration marks like ゝ / ゞ / ヽ / ヾ should be treated as CJK line-start-prohibited

What remains:

a small context-width class around punctuation/quote compression
good evidence for the exactness ceiling of a width-independent grapheme-sum model in proportional Japanese fonts

So Japanese stays as a canary, not as a place to keep stacking narrow punctuation rules.

Chinese

Chinese is now the clearest active CJK canary.

What we learned:

Safari is clean on the maintained step10 sweep
Chrome keeps a broader narrow-width positive field
the field changes with font choice (Songti SC vs PingFang SC)

What did not survive:

carrying closing punctuation forward
coalescing repeated punctuation runs like —— or ……

Current read:

the remaining Chinese field is real
it is not another obvious punctuation bug
it is best treated as a canary for the model’s current exactness ceiling

Sampled cross-font corpus matrix

The first cross-font pass was reassuring:

Korean, Thai, Khmer, Hindi, Arabic, and Hebrew all stayed exact across the sampled Chrome matrix on this machine

That does not mean font fragility is gone. It just means the next likely surprises are:

new scripts
finer width sweeps
or product-shaped mixed text

Segment metrics cache

The cache used to store just widths. It now stores richer per-segment metrics and computes the more expensive derived facts lazily.

Current useful cached facts include:

width
containsCJK
lazily computed emoji count
lazily computed grapheme widths

That improved repeated prepare() work without moving any live measurement back into layout().

Soft hyphen support

Soft hyphen became a real internal break kind instead of ordinary text.

What that bought us:

unbroken lines keep it invisible
broken lines can expose a visible trailing -
rich APIs stay aligned with the actual break choice

This was a genuine model improvement, not just a cosmetic API change.

What Sebastian already knew

Sebastian’s original prototype already had the right overall instinct:

words/runs as the unit of caching
browser-grounded measurement
streamed greedy line breaking

What changed here was mostly engineering discipline:

caching
a clean prepare() / layout() split
preprocessing
browser diagnostics
and a willingness to keep the hot path simple

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research Log

Current steering summary

The problem: DOM measurement interleaving

Approach 1: Canvas measureText + word-width caching

Rejected: DOM-based or string-reconstruction measurement in the hot path

Discovery: system-ui font resolution mismatch

Discovery: word-by-word sum accuracy

Discovery: text-shaper is a useful reference, not a runtime replacement

Discovery: preserving ordinary spaces, hard breaks, and numeric tab stops is viable

Discovery: emoji canvas/DOM width discrepancy

Retired HarfBuzz probe path

Final browser sweep closure

Arabic frontier

Long-form corpus canaries

Mixed app text

Thai

Khmer

Lao (rejected)

Myanmar

Japanese

Chinese

Sampled cross-font corpus matrix

Segment metrics cache

Soft hyphen support

What Sebastian already knew

FilesExpand file tree

RESEARCH.md

Latest commit

History

RESEARCH.md

File metadata and controls

Research Log

Current steering summary

The problem: DOM measurement interleaving

Approach 1: Canvas measureText + word-width caching

Rejected: DOM-based or string-reconstruction measurement in the hot path

Discovery: system-ui font resolution mismatch

Discovery: word-by-word sum accuracy

Discovery: text-shaper is a useful reference, not a runtime replacement

Discovery: preserving ordinary spaces, hard breaks, and numeric tab stops is viable

Discovery: emoji canvas/DOM width discrepancy

Retired HarfBuzz probe path

Final browser sweep closure

Arabic frontier

Long-form corpus canaries

Mixed app text

Thai

Khmer

Lao (rejected)

Myanmar

Japanese

Chinese

Sampled cross-font corpus matrix

Segment metrics cache

Soft hyphen support

What Sebastian already knew