-
Notifications
You must be signed in to change notification settings - Fork 283
feat(core): improve stdin-buffer emoji/IME handling with Kitty protocol #483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances emoji and IME input handling in the stdin-buffer module, particularly for terminals using the Kitty keyboard protocol. The implementation adds sophisticated grapheme cluster handling to properly process complex emoji sequences like ZWJ families, flags, and skin-tone variants.
Key Changes:
- Introduces grapheme-segmenter module with
Intl.Segmenterpolyfill support for proper Unicode grapheme cluster handling - Adds Kitty keyboard protocol support that buffers and reassembles emoji codepoints sent as separate escape sequences
- Refactors sequence completion helpers for better code reuse
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/lib/stdin-buffer.ts |
Core changes: adds Kitty protocol emoji buffering, grapheme cluster extraction, and helper functions for Unicode codepoint classification |
packages/core/src/lib/grapheme-segmenter.ts |
New module providing grapheme segmentation utilities with Intl.Segmenter support and performance optimizations |
packages/core/src/lib/parse.keypress.ts |
Updates keypress parsing to use grapheme-aware checks instead of simple length checks |
packages/core/src/lib/stdin-buffer.test.ts |
Comprehensive test coverage for grapheme handling and Kitty protocol emoji reassembly |
packages/core/package.json |
Adds @formatjs/intl-segmenter dependency for polyfill support |
bun.lock |
Lock file updates for new dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators | ||
| (codepoint >= 0x1f300 && codepoint <= 0x1faff) || // Emoji ranges (simplified) | ||
| codepoint === 0x1f3f4 || // Black Flag | ||
| (codepoint >= 0x23 && codepoint <= 0x39) || // #, *, 0-9 for keycaps |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range check for keycap characters (0x23 to 0x39) includes some ASCII characters that shouldn't start grapheme clusters. The range 0x23-0x39 includes:
-
(0x23) ✓ valid for keycaps
- $ (0x24) ✗ not used for keycaps
- % through ) (0x25-0x29) ✗ not used for keycaps
-
- (0x2a) ✓ valid for keycaps
-
- , - . / (0x2b-0x2f) ✗ not used for keycaps
- 0-9 (0x30-0x39) ✓ valid for keycaps
This overly broad range could cause unintended buffering of characters like $, %, &, etc. Consider using a more precise check: (codepoint >= 0x30 && codepoint <= 0x39) || codepoint === 0x23 || codepoint === 0x2a
| (codepoint >= 0x23 && codepoint <= 0x39) || // #, *, 0-9 for keycaps | |
| ((codepoint >= 0x30 && codepoint <= 0x39) || codepoint === 0x23 || codepoint === 0x2a) || // #, *, 0-9 for keycaps |
| let segmenter: Intl.Segmenter | null = null | ||
|
|
||
| if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") { | ||
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top-level await is used here which can cause issues in certain module systems and bundlers. This is executed at module load time, which could:
- Block the entire module graph from loading if the polyfill import is slow
- Cause problems in environments that don't support top-level await (though increasingly rare)
- The error is silently caught, which means there's no feedback if the polyfill fails to load
Consider moving this to a lazy initialization pattern inside getGraphemeSegmenter() or make the polyfill loading explicit and required upfront during application initialization.
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) | |
| import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) |
| codepoint === 0x200d || // ZWJ | ||
| (codepoint >= 0xfe00 && codepoint <= 0xfe0f) || // Variation Selectors | ||
| (codepoint >= 0x1f3fb && codepoint <= 0x1f3ff) || // Skin Tone Modifiers | ||
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regional Indicator codepoints (0x1f1e6-0x1f1ff) are listed in both isGraphemeExtender and canStartGraphemeCluster. While this works because the grapheme segmenter ultimately handles the splitting correctly, the classification is semantically confusing. Regional Indicators don't "extend" a grapheme - they combine in pairs to form flag emoji. Consider clarifying the logic or renaming the functions to better reflect their purpose in the Kitty buffering strategy (e.g., shouldBufferForCombining instead of isGraphemeExtender).
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators |
| export function firstGrapheme(str: string): string { | ||
| if (str.length === 0) return "" | ||
| const firstCode = str.charCodeAt(0) | ||
| if (firstCode < 128) { | ||
| if (str.length === 1 || str.charCodeAt(1) < 128) return str[0]! | ||
| } | ||
|
|
||
| const segments = getGraphemeSegmenter().segment(str) | ||
| const first = segments[Symbol.iterator]().next() | ||
| return first.done ? "" : first.value.segment |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with surrogate pairs handling. If firstCode < 128 is false but str.length === 1, the code will try to get the first grapheme from the segmenter. However, if the string contains a single unpaired surrogate (which is technically invalid UTF-16), this could cause issues. While unpaired surrogates are rare in practice, consider adding a check or documenting that the input is expected to be valid UTF-16.
| if (remaining.length > 0) this.process(remaining) | ||
| return true |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tryCompletePaste() method can trigger recursive calls through this.process(remaining) at line 312. If the pasted content contains multiple nested bracketed paste sequences or a large amount of remaining data, this could potentially cause a deep call stack. While unlikely in practice, consider whether this recursion is bounded or if an iterative approach would be safer for handling remaining content after paste completion.
| } | ||
| if (this.kittyBuffer.length === 0) return [] | ||
|
|
||
| const text = String.fromCodePoint(...this.kittyBuffer) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the spread operator with String.fromCodePoint(...this.kittyBuffer) could cause issues if the kittyBuffer becomes very large. JavaScript has a maximum call stack size, and spreading a large array as function arguments can exceed this limit (typically around 65,536 arguments, but varies by engine).
While emoji sequences are typically short (< 20 codepoints), consider adding a safeguard or chunking mechanism if there's any possibility of the buffer growing unexpectedly large. Alternatively, document the expected maximum size of the kittyBuffer.
| const text = String.fromCodePoint(...this.kittyBuffer) | |
| let text = "" | |
| for (let i = 0; i < this.kittyBuffer.length; i++) { | |
| text += String.fromCodePoint(this.kittyBuffer[i]!) | |
| } |
| if (this.kittyTimeout) { | ||
| clearTimeout(this.kittyTimeout) | ||
| this.kittyTimeout = null | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clear() method now clears kittyTimeout in addition to the regular timeout, but there's no test coverage for this scenario. Consider adding a test that:
- Processes a Kitty sequence that would trigger buffering (e.g., a Regional Indicator)
- Calls
clear()before the kittyTimeout fires - Waits beyond the timeout duration
- Verifies that nothing is emitted
This would ensure that the kittyTimeout is properly cleaned up and doesn't cause spurious emissions after clear() is called.
| } | ||
| this.emitSequences(extractCompleteSequences(this.buffer.slice(0, startIndex)).sequences) | ||
| } | ||
| this.emitKittyBuffer() |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When entering bracketed paste mode, emitKittyBuffer() is called to flush any pending Kitty codepoints. However, there's no test coverage for the scenario where:
- A Kitty emoji sequence is partially buffered (e.g., first part of a skin-tone emoji)
- Bracketed paste mode is entered
- The buffered Kitty sequence should be emitted before processing the pasted content
Consider adding a test to verify this edge case is handled correctly.
| function extractKittyCodepoint(sequence: string): number | null { | ||
| const match = KITTY_UNICODE_RE.exec(sequence) | ||
| if (!match) return null | ||
| const cp = parseInt(match[1]!, 10) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-null assertion match[1]! is safe because the regex requires at least one \d+ group, so match[1] will always be defined when the regex matches. However, for code clarity and to avoid potential confusion, consider documenting this assumption or restructuring the code to make it more explicit.
| const cp = parseInt(match[1]!, 10) | |
| const [, group] = match | |
| if (group === undefined) return null | |
| const cp = parseInt(group, 10) |
| function isGraphemeExtender(codepoint: number): boolean { | ||
| return ( | ||
| codepoint === 0x200d || // ZWJ | ||
| (codepoint >= 0xfe00 && codepoint <= 0xfe0f) || // Variation Selectors | ||
| (codepoint >= 0x1f3fb && codepoint <= 0x1f3ff) || // Skin Tone Modifiers | ||
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators | ||
| codepoint === 0x20e3 || // Combining Enclosing Keycap | ||
| (codepoint >= 0xe0020 && codepoint <= 0xe007f) // Tag characters | ||
| ) | ||
| } | ||
|
|
||
| function canStartGraphemeCluster(codepoint: number): boolean { | ||
| return ( | ||
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators | ||
| (codepoint >= 0x1f300 && codepoint <= 0x1faff) || // Emoji ranges (simplified) | ||
| codepoint === 0x1f3f4 || // Black Flag | ||
| (codepoint >= 0x23 && codepoint <= 0x39) || // #, *, 0-9 for keycaps | ||
| (codepoint >= 0x2600 && codepoint <= 0x27bf) // Misc Symbols & Dingbats | ||
| ) | ||
| } | ||
|
|
||
| function extractKittyCodepoint(sequence: string): number | null { | ||
| const match = KITTY_UNICODE_RE.exec(sequence) | ||
| if (!match) return null | ||
| const cp = parseInt(match[1]!, 10) | ||
| return cp >= 0 && cp <= 0x10ffff ? cp : null | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper functions isGraphemeExtender, canStartGraphemeCluster, and extractKittyCodepoint lack documentation comments. While the inline comments provide some context, these functions would benefit from JSDoc comments explaining:
- Their purpose in the Kitty keyboard protocol handling
- Parameter descriptions
- Return value meanings
- Example usage or edge cases
This would improve code maintainability, especially given the complexity of grapheme cluster handling.
85e4eec to
f76840a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| const CHUNK_SIZE = 8192 | ||
| for (let i = 0; i < buffer.length; i += CHUNK_SIZE) { | ||
| const slice = buffer.slice(i, i + CHUNK_SIZE) | ||
| chunks.push(String.fromCodePoint.apply(null, slice)) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using String.fromCodePoint.apply(null, slice) can throw a "Maximum call stack size exceeded" error if the slice array is very large, even though CHUNK_SIZE is set to 8192 to mitigate this. Different JavaScript engines have different limits (typically 65536-100000 arguments).
Consider using the spread operator with a safer approach or a loop:
chunks.push(String.fromCodePoint(...slice))Or reduce CHUNK_SIZE to a more conservative value like 1000-5000 to ensure compatibility across all engines.
| const CHUNK_SIZE = 8192 | |
| for (let i = 0; i < buffer.length; i += CHUNK_SIZE) { | |
| const slice = buffer.slice(i, i + CHUNK_SIZE) | |
| chunks.push(String.fromCodePoint.apply(null, slice)) | |
| const CHUNK_SIZE = 4096 | |
| for (let i = 0; i < buffer.length; i += CHUNK_SIZE) { | |
| const slice = buffer.slice(i, i + CHUNK_SIZE) | |
| chunks.push(String.fromCodePoint(...slice)) |
| } else if (s.length === 1 && s >= "A" && s <= "Z") { | ||
| // shift+letter | ||
| key.name = s.toLowerCase() | ||
| key.shift = true |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to isSingleGrapheme(s) at line 302 invokes the Intl.Segmenter for every keypress that isn't matched by earlier conditions. This happens after checking for single ASCII characters (lines 291-301), but the isSingleGrapheme function itself also checks for ASCII at lines 54-58 in grapheme-segmenter.ts, resulting in duplicate checks.
For better performance, the ASCII check could be moved earlier in this function to avoid calling isSingleGrapheme for common ASCII characters. However, the current structure might be intentional for code clarity.
| key.shift = true | |
| key.shift = true | |
| } else if (s.length === 1 && s >= " " && s <= "\x7f") { | |
| // remaining single printable ASCII characters (punctuation, symbols, etc.) | |
| key.name = s |
| function isGraphemeExtender(codepoint: number): boolean { | ||
| return ( | ||
| codepoint === 0x200d || // ZWJ | ||
| (codepoint >= 0xfe00 && codepoint <= 0xfe0f) || // Variation Selectors | ||
| (codepoint >= 0x1f3fb && codepoint <= 0x1f3ff) || // Skin Tone Modifiers | ||
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators | ||
| codepoint === 0x20e3 || // Combining Enclosing Keycap | ||
| (codepoint >= 0xe0020 && codepoint <= 0xe007f) // Tag characters | ||
| ) | ||
| } | ||
|
|
||
| function canStartGraphemeCluster(codepoint: number): boolean { | ||
| return ( | ||
| (codepoint >= 0x1f1e6 && codepoint <= 0x1f1ff) || // Regional Indicators | ||
| (codepoint >= 0x1f300 && codepoint <= 0x1faff) || // Emoji ranges (simplified) | ||
| codepoint === 0x1f3f4 || // Black Flag | ||
| codepoint === 0x23 || // # for keycap | ||
| codepoint === 0x2a || // * for keycap | ||
| (codepoint >= 0x30 && codepoint <= 0x39) || // 0-9 for keycaps | ||
| (codepoint >= 0x2600 && codepoint <= 0x27bf) // Misc Symbols & Dingbats | ||
| ) | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper functions isGraphemeExtender and canStartGraphemeCluster lack documentation explaining their purpose and the Unicode ranges they cover. Given the complexity of Unicode grapheme cluster rules, these functions should have comments explaining:
- What each Unicode range represents (the inline comments are good but could be in function-level docs)
- Why these specific ranges were chosen
- That this is a heuristic for buffering, not a complete grapheme cluster algorithm (which is delegated to Intl.Segmenter)
| if (str.length === 1) return str[0]! | ||
| const secondCode = str.charCodeAt(1) | ||
| if (secondCode < 128) return str[0]! | ||
| } else if (str.length === 1) { |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The firstGrapheme function has an optimization path that returns str[0]! for single-character strings (line 74), but this doesn't handle the edge case where the string might be a single invalid surrogate. While unlikely in practice since the function is called on stdin input, it would be more robust to check firstCode >= 0xd800 && firstCode <= 0xdfff and fall through to the segmenter for safety.
| } else if (str.length === 1) { | |
| } else if (str.length === 1 && (firstCode < 0xd800 || firstCode > 0xdfff)) { |
| export function getGraphemeSegmenter(): Intl.Segmenter { | ||
| if (segmenter) return segmenter | ||
|
|
||
| if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") { | ||
| segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" }) | ||
| return segmenter | ||
| } | ||
|
|
||
| if (initError) { | ||
| throw initError | ||
| } | ||
|
|
||
| throw new Error( | ||
| "Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter", | ||
| ) | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The polyfill initialization is asynchronous but called at module load time (line 23). However, getGraphemeSegmenter() can be called synchronously before the polyfill finishes loading. If the native Intl.Segmenter is not available and the polyfill hasn't loaded yet, the function will throw an error instead of waiting for initialization to complete.
This could cause runtime errors when the stdin-buffer is used immediately after import in environments without native Intl.Segmenter support. Consider either:
- Making
getGraphemeSegmenter()async and awaiting the initialization - Checking if
initPromiseis still pending and waiting for it - Documenting that users must await initialization before using the module
| await import("@formatjs/intl-segmenter/polyfill-force.js") | ||
| } catch (e) { | ||
| initError = new Error( | ||
| "Failed to load Intl.Segmenter polyfill. Please ensure @formatjs/intl-segmenter is installed or use a runtime that supports Intl.Segmenter natively.", |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from the polyfill import is caught but not stored correctly. The catch block at line 12 receives the error as parameter e, but then creates a new generic Error message without preserving the original error information. This makes debugging polyfill loading issues difficult.
The initError should either wrap the original error or include its message: initError = new Error("Failed to load Intl.Segmenter polyfill: " + (e instanceof Error ? e.message : String(e)))
| "Failed to load Intl.Segmenter polyfill. Please ensure @formatjs/intl-segmenter is installed or use a runtime that supports Intl.Segmenter natively.", | |
| "Failed to load Intl.Segmenter polyfill: " + | |
| (e instanceof Error ? e.message : String(e)) + | |
| " Please ensure @formatjs/intl-segmenter is installed or use a runtime that supports Intl.Segmenter natively.", |
…n StdinBuffer - Add grapheme-segmenter.ts using Intl.Segmenter for Unicode-correct text segmentation - Update StdinBuffer to emit complete grapheme clusters instead of single characters - Preserve Kitty keyboard protocol sequences unchanged for downstream parsing - Add comprehensive tests for grapheme cluster handling
f76840a to
6dbb9fa
Compare
Buffer Kitty keyboard protocol emoji sequences and reassemble them into complete grapheme clusters before emitting keypress events. - Add emoji codepoint detection helpers (isGraphemeExtender, canStartGraphemeCluster) - Buffer Kitty sequences that form multi-codepoint emoji (ZWJ, flags, skin tones, keycaps) - Flush buffer on timeout or when non-emoji input arrives - Preserve all raw sequences in the emitted KeyEvent.raw field - Add comprehensive tests for emoji reassembly scenarios
…yco#51) to emoji helpers
- Reject lone surrogates in firstGrapheme fast path (fall through to segmenter) - Preserve original error message when polyfill loading fails
|
Interesting, I hadn't thought about it from that perspective. That means full grapheme parsing for input though, which can get out of hand really fast. The stdin buffer was meant to handle incomplete sequences, split at byte level, which would break some behaviours. For input in the Textarea for example, even if the bytes arrive chunked, they would end up forming a joined grapheme at the end. Is this meant to have full grapheme character key events? |
…JK detection Update parse.keypress.ts to use isSingleGrapheme() instead of s.length === 1 for proper handling of multi-codepoint characters like emoji and CJK.
|
The grapheme-aware input is already on my PR branch with ASCII fast paths, so performance impact is minimal for typical input.
The issue I hit was specifically with Rime IME emoji input - it sends input through some keystroke-based mechanism that differs from:
These might be using the same underlying mechanism, but I haven't dug deeper. |
|
It feels like something is more fundamentally wrong with the stdin buffer if it causes such a complexity explosion. Grapheme detection is horribly fragile at the moment and already has a huge complexity in the native part to render them correctly. Doing something similar for input seems like beating a symptom of wrong underlying assumptions in the stdin buffer. |
|
zig/native side grapheme is used for rendering, the stdin buffer is for input parsing, i must say they are two different things. the stdin buffer itself is simple. these complexity is brought by the terminal protocol and Unicode itself (see libghostty, utf8 everywhere etc.), anyway not a design issue. it's not correct to break UTF-8 by bytes or to break emoji by codepoints if not according to grapheme. |
|
Okay, I think I understand better now. This is specifically a Kitty Keyboard issue, which emits separate input events for a ZWJ emoji for example, right? And even though inserting them one by one it would be rendered correctly, but not emitted as single input event, but separate ones for each byte? |
|
Yeah exactly. Kitty protocol emits separate escape sequences per codepoint, not per byte. So 👨👩👧 (family emoji, 7 codepoints joined by ZWJ) comes as 7 separate |
Summary
Add proper grapheme segmentation and Kitty emoji reassembly for correct handling of emoji, CJK characters, and other multi-codepoint sequences.
Changes
StdinBuffer (grapheme segmentation)
grapheme-segmenter.tsusingIntl.Segmenterfor Unicode-correct text segmentationStdinBufferto emit complete grapheme clusters for non-escape inputKeyHandler (Kitty emoji reassembly)
isGraphemeExtender,canStartGraphemeCluster)KeyEvent.rawfieldArchitecture
Key insight: StdinBuffer handles grapheme segmentation for raw UTF-8 input, while KeyHandler handles emoji reassembly for Kitty protocol sequences. This separation preserves the
rawfield correctly.Supported Emoji Types
Testing