Skip to content

Do not linktrail if following text is not [a-z]?#414

Merged
kristian-clausal merged 3 commits intomainfrom
linktrailing
Mar 9, 2026
Merged

Do not linktrail if following text is not [a-z]?#414
kristian-clausal merged 3 commits intomainfrom
linktrailing

Conversation

@kristian-clausal
Copy link
Copy Markdown
Collaborator

See wiktectract issue #1604
tatuylonen/wiktextract#1604 https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

This should not be merged as is, because it will create problems in other extractors that might rely on different behavior.

In the best-case scenario, there might be two different camps: 1) Languages that use spaces that want to do linktrailing 2) Languages without spaces that can't do linktrailing

If this is the case, we might be able to get away with a kludge that checks whether the script of the last character in the link matches the script of the first character after the link.

See wiktectract issue #1604
tatuylonen/wiktextract#1604
https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

This should not be merged as is, because it will create problems in
other extractors that might rely on different behavior.

In the best-case scenario, there might be two different camps:
1) Languages that use spaces that want to do linktrailing
2) Languages without spaces that can't do linktrailing

If this is the case, we might be able to get away with a
kludge that checks whether the script of the last character
in the link matches the script of the first character after
the link.
See wiktectract issue #1604
tatuylonen/wiktextract#1604
https://en.wikipedia.org/wiki/Help:Wikitext#Blend_link

This adds a new attribute to Wtp that contains a `re.Pattern`
object used for pattern-matching these kinds of suffixed links.

Modify `Wtp.linktrailing_re` to change the behavior based
on how the parsed Wikimedia project handles linktrailing.

English uses `[a-z]+`.
Our default implementation uses `\w+`, which should be fine
most of the time.
Languages without spaces seem to use the English `[a-z]+`,
which seems to make sense. `[[englishword]]KANJI` wouldn't
have the kanji characters be consumed, but `\w+` breaks this.
We have a `NAMESPACEE` field in `parserfns` (`{{{NAMESPACEE}}}`,
it's unimplement) which pisses off the linter for some
reason.
@kristian-clausal kristian-clausal merged commit 9d9a410 into main Mar 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant