Skip to content

Commit e0d596e

Browse files
authored
Fix markdown generation crash on LLM special tokens (#4154)
The gpt-tokenizer encode() call rejects special tokens like <|im_start|> that appear in documentation code examples. Pass allowedSpecial: 'all' since these are content, not control tokens.
1 parent aeafa91 commit e0d596e

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

scripts/generate-markdown.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ const convertPage = async (htmlPath) => {
226226

227227
const title = extractTitle(dom.window.document);
228228
const markdown = toMarkdown(articleEl, dom);
229-
const tokens = encode(markdown).length;
229+
const tokens = encode(markdown, { allowedSpecial: 'all' }).length;
230230
const content = buildFrontmatter(title, tokens) + markdown + '\n';
231231
const mdPath = htmlPath.replace(/\.html$/, '.md');
232232

0 commit comments

Comments
 (0)