Skip to content

Latest commit

 

History

History
344 lines (238 loc) · 13.5 KB

File metadata and controls

344 lines (238 loc) · 13.5 KB

Research Log

Everything we tried, measured, and learned while building this library.

For the current compact browser-accuracy / benchmark snapshot, see STATUS.md. For the current compact corpus / sweep snapshot, see corpora/STATUS.md. For the shared mismatch vocabulary, see corpora/TAXONOMY.md.

Current steering summary

This log is historical. The current practical steering picture is:

  • Japanese has two real canaries (羅生門, 蜘蛛の糸), both clean at anchor widths and both still exposing a small positive one-line field on broader Chrome sweeps.
  • Chinese has two long-form canaries (祝福, 故鄉) showing the same broad Chrome-positive / Safari-clean split, with real font sensitivity between Songti SC and PingFang SC.
  • Myanmar still has two real canaries with residual Chrome/Safari disagreement around quote/follower-style classes, so it remains the main unresolved Southeast Asian frontier.
  • Urdu has a real Nastaliq/Naskh canary (چغد) with the same narrow-width negative field in Chrome and Safari, so it is clearly a shaping/context class rather than dirty data or a browser-only quirk. It remains parked rather than actively tuned.
  • Arabic coarse corpora are clean; the remaining work there is mostly a fine-width edge-fit class, not the old preprocessing/corpus-hygiene problems.
  • Mixed app text still matters because it catches product-shaped classes that books miss, especially soft-hyphen and extractor-sensitive cases.

The problem: DOM measurement interleaving

When UI components independently measure text heights with DOM reads like getBoundingClientRect(), each read can force synchronous layout. If those reads interleave with writes, the browser can end up relaying out the whole document repeatedly.

The goal here was always the same:

  • do the expensive text work once in prepare()
  • keep layout() arithmetic-only
  • make resize-driven relayout cheap and coordination-free

Approach 1: Canvas measureText + word-width caching

Canvas measureText() avoids DOM layout. It goes straight to the browser's font engine.

That led to the basic two-phase model:

  • prepare(text, font) — segment text, measure segments, cache widths
  • layout(prepared, maxWidth, lineHeight) — walk cached widths with pure arithmetic

That architecture held up. The broad browser sweeps are now clean in Chrome, Safari, and Firefox, and the hot layout() path is still the core product win.

Rejected: DOM-based or string-reconstruction measurement in the hot path

Several alternatives were tried and rejected:

  • measuring full candidate lines as strings during layout()
  • moving measurement into hidden DOM elements during prepare()
  • using SVG getComputedTextLength()

The pattern was consistent:

  • they either reintroduced DOM reads
  • or they were slower than the current two-phase model
  • or they looked cleaner locally but regressed the actual benchmark path

The important keep was architectural, not algorithmic:

  • layout() stayed arithmetic-only on cached widths

Discovery: system-ui font resolution mismatch

Canvas and DOM resolve system-ui to different font variants on macOS at certain sizes:

Machine-readable scan:

In the recorded scan, mismatches clustered at 10-12px, 14px, and 26px. 13px, 15-25px, and 27-28px were exact.

macOS uses SF Pro Text at smaller sizes and SF Pro Display at larger sizes. Canvas and DOM switch between them at different thresholds.

Practical conclusion:

  • use a named font if accuracy matters
  • keep system-ui documented as unsafe
  • if we ever support it properly, the believable path is a narrow prepare-time DOM fallback for detected bad tuples

What did not look trustworthy enough:

  • lookup tables
  • naive scaling
  • guessed resolved-font substitution

Discovery: word-by-word sum accuracy

Canvas is internally consistent enough that summing measured segments works very well, but not perfectly. Over a full paragraph, tiny adjacency differences can accumulate into a line-edge error.

The keeps were small and semantic:

  • merge punctuation into the preceding word before measuring
  • let trailing collapsible spaces hang instead of forcing a break

What did not survive:

  • full-string verification in layout()
  • uniform rescaling
  • generic pair-level correction models

The broad lesson was that local semantic preprocessing paid off more than clever runtime correction.

Discovery: text-shaper is a useful reference, not a runtime replacement

text-shaper was useful reference material, especially for Unicode coverage and bidi ideas, but not a replacement for the current browser-facing model.

What was worth taking:

  • broader Unicode coverage, e.g. missing CJK extension blocks

What was not worth taking:

  • its segmentation as a runtime replacement for Intl.Segmenter
  • its paragraph breaker as a substitute for browser-parity layout

Bottom line:

  • good reference material
  • wrong runtime center of gravity for this repo

Discovery: preserving ordinary spaces, hard breaks, and numeric tab stops is viable

The smallest honest second whitespace mode turned out to be:

  • preserve ordinary spaces
  • preserve \n hard breaks
  • preserve tabs with default browser-style tab stops
  • leave the other wrapping defaults alone

That became:

  • { whiteSpace: 'pre-wrap' }

What mattered:

  • preserved spaces still hang at line end
  • consecutive hard breaks keep empty lines
  • a trailing final hard break does not invent an extra empty line
  • tabs advance to the next default browser tab stop from the current line start

The mode now covers the textarea-like cases we cared about, and the broad browser sweeps plus the dedicated pre-wrap oracle are green.

One important tooling lesson also came out of this:

  • keep a small permanent oracle suite
  • justify it once with a broader brute-force validation pass
  • do not keep the brute-force pass forever once it has done its job

Discovery: emoji canvas/DOM width discrepancy

Chrome and Firefox on macOS can measure emoji wider in canvas than in DOM at small sizes. Safari does not share the same discrepancy.

What held up:

  • detect the discrepancy by comparing canvas emoji width against actual DOM emoji width per font
  • cache that correction
  • keep it outside the hot layout path

This is now one of the small browser-profile shims that is actually justified.

Retired HarfBuzz probe path

We briefly kept a headless HarfBuzz backend in the repo for server-side measurement probes.

What it taught us:

  • it was useful for research and algorithm probes
  • it was not close enough to our active browser-grounded path to justify keeping it in the main repo
  • isolated Arabic words in that probe path needed explicit LTR direction to avoid misleading widths

So if HarfBuzz comes up again later, treat it as explored territory:

  • useful as a research reference
  • not the runtime direction for Pretext
  • not a substitute for browser-oracle or browser-canvas validation

Final browser sweep closure

The last browser mismatches were not fixed by moving more work into layout(). That regressed the hot path and was reverted.

What actually held up:

  • better preprocessing in prepare()
  • better browser diagnostics pages and scripts
  • a tiny browser-specific line-fit tolerance

What did not change:

  • layout() stayed arithmetic-only

That remains the right center of gravity for the project.

Arabic frontier

Arabic took several passes, but the pattern is clearer now.

What survived:

  • merge no-space Arabic punctuation clusters during prepare()
    • e.g. فيقول:وعليك, همزةٌ،ما
  • treat Arabic punctuation-plus-mark clusters like ،ٍ as left-sticky too
  • split " " + combining marks into plain space plus marks attached to the following word
  • use normalized slices and the exact corpus font during probe work
  • trust the better RTL diagnostics path instead of reconstructing offsets from rendered line text
  • clean obvious corpus/source artifacts instead of inventing new engine rules for them
  • allow a tiny non-Safari line-fit tolerance bump for the remaining positive fine-width field

What did not survive:

  • pair correction models at segment boundaries
  • larger Arabic run-slice width models
  • broad phrase-level heuristics derived from one good-looking probe

Those failed for the same reason in different sizes:

  • pair corrections were too local to move the real misses
  • run-slice widths were much heavier and still did not move the hard widths enough
  • both made prepare() or layout() materially worse without buying a clean Arabic field

So the useful guardrail is:

  • if an Arabic idea starts by adding more shaping-aware width caches inside the current segment-sum architecture, be skeptical early
  • the Arabic keeps so far have been preprocessing, corpus cleanup, diagnostics, and tiny tolerance shims, not richer width-cache models

Current read:

  • Arabic coarse corpora are healthy
  • the remaining work is much narrower now
  • the unresolved class looks like a mix of fine-width edge-fit and shaping/context, not another obvious preprocessing hole

Long-form corpus canaries

Once the main browser sweep became a regression gate, the long-form corpora became the real steering canaries.

Mixed app text

This is the most product-shaped canary.

What it has been good for:

  • URL/query-string handling
  • escaped quote clusters
  • numeric expressions like २४×७
  • time ranges like 7:00-9:00
  • emoji ZWJ runs
  • manual soft hyphens

Important keep:

  • model URL/query strings as narrow structured units, not one giant breakable blob

Current status:

  • almost entirely clean
  • one remaining extractor-sensitive soft-hyphen miss around 710px still looks paragraph-scale or accumulation-sensitive rather than like a neat local bug

Thai

Thai exposed a product-shaped ASCII quote issue more than a dictionary-segmentation failure.

The keep:

  • contextual ASCII quote glue during preprocessing

Result:

  • two Thai prose corpora are healthy at anchor widths
  • maintained step10 sweeps stayed clean enough that Thai now looks broader than one lucky story

Khmer

Khmer broadened the Southeast Asian class without immediately demanding new engine work.

The keep:

  • preserve explicit zero-width separators from the source text

Result:

  • anchor widths and the maintained step10 sweep were clean enough to keep Khmer as a real canary

Lao (rejected)

The Lao corpus attempt was a source problem, not an engine problem.

The raw text was wrapped print/legal text, which made it a dirty white-space: normal canary. We rejected it instead of normalizing nonsense into the repo.

Myanmar

Myanmar is still the main unresolved Southeast Asian frontier.

What survived:

  • treat / / / / as left-sticky during preprocessing
  • treat as medial glue in clusters like ကျွန်ုပ်၏လက်မ

What did not survive:

  • broad Myanmar grapheme breaking in ordinary wrapping
  • quote-follower glue like closing-quote + ဟု

Current read:

  • there are real recurring classes here
  • but the obvious tempting heuristics improved one browser and hurt another
  • that makes Myanmar a canary, not a license for more instinctive glue rules

Japanese

Japanese gave us one real semantic keep:

  • kana iteration marks like / / / should be treated as CJK line-start-prohibited

What remains:

  • a small context-width class around punctuation/quote compression
  • good evidence for the exactness ceiling of a width-independent grapheme-sum model in proportional Japanese fonts

So Japanese stays as a canary, not as a place to keep stacking narrow punctuation rules.

Chinese

Chinese is now the clearest active CJK canary.

What we learned:

  • Safari is clean on the maintained step10 sweep
  • Chrome keeps a broader narrow-width positive field
  • the field changes with font choice (Songti SC vs PingFang SC)

What did not survive:

  • carrying closing punctuation forward
  • coalescing repeated punctuation runs like —— or ……

Current read:

  • the remaining Chinese field is real
  • it is not another obvious punctuation bug
  • it is best treated as a canary for the model’s current exactness ceiling

Sampled cross-font corpus matrix

The first cross-font pass was reassuring:

  • Korean, Thai, Khmer, Hindi, Arabic, and Hebrew all stayed exact across the sampled Chrome matrix on this machine

That does not mean font fragility is gone. It just means the next likely surprises are:

  • new scripts
  • finer width sweeps
  • or product-shaped mixed text

Segment metrics cache

The cache used to store just widths. It now stores richer per-segment metrics and computes the more expensive derived facts lazily.

Current useful cached facts include:

  • width
  • containsCJK
  • lazily computed emoji count
  • lazily computed grapheme widths

That improved repeated prepare() work without moving any live measurement back into layout().

Soft hyphen support

Soft hyphen became a real internal break kind instead of ordinary text.

What that bought us:

  • unbroken lines keep it invisible
  • broken lines can expose a visible trailing -
  • rich APIs stay aligned with the actual break choice

This was a genuine model improvement, not just a cosmetic API change.

What Sebastian already knew

Sebastian’s original prototype already had the right overall instinct:

  • words/runs as the unit of caching
  • browser-grounded measurement
  • streamed greedy line breaking

What changed here was mostly engineering discipline:

  • caching
  • a clean prepare() / layout() split
  • preprocessing
  • browser diagnostics
  • and a willingness to keep the hot path simple