zabon.ruby 🍊

A Ruby gem / Rails helper for dealing with Japanese line-breaking logic. It is basically a port of mikan.js, which implements a regular expression based algorithm to segment text into semantic chunks. No machine learning needed 🤖☺️. In addition the resulting text segments can be wrapped in a configurable HTML tag. All praise 👏👏👏 for the algorithm goes to trkbt10.

Usage

# split this sentence
Zabon.split('この文を分割する')
 => ["この", "文を", "分割する"]

Configuration

Configuration is used for tag that the results can be wrapped in. It's making heavy use of Rails tag helpers. E.g. put this in an initializer in your Rails app.

Zabon.configure do |config|
  config.tag = :div # default: :span
  config.tag_options = { class: 'zabon_trara', style: 'font_size: 5em' } # default:  { class: 'zabon', style: 'display: inline-block' }
  config.strip_tags = false # default true
end

Rails

The gem ships a Railtie that automatically includes Zabon::Helper into ActionView::Base, so zabon_translate is available in all views without any further setup.

Call zabon_translate directly in views for strings that need segmentation:

<%= zabon_translate("page.title") %>

To replace the standard t() globally, add the following to an initializer. Note that this affects all ActionView translation calls — prefer explicit zabon_translate for finer control.

# config/initializers/zabon.rb
module ActionView
  module Helpers
    module TranslationHelper
      alias_method :translate_without_zabon, :translate

      def translate(key, **options)
        zabon_translate(key, orig_translate: :translate_without_zabon, **options)
      end

      alias t translate
    end
  end
end

Japanese grammar 🇯🇵

Just enough Japanese to understand the algorithm :)

Writing system ✍️

The Japanese writing system uses for different components:

Hiragana (ひらがな), a syllabary alphabet used for Japanese words not covered by kanji and mostly for grammatical inflections
Katakana (カタカナ), a syllabary alphabet used for transcription of foreign-language words into Japanese; for emphasis; onomatopoeia; for scientific terms and often Japanese companies.
Kanji (漢字), a set of Chinese characters directly incorporated into the written Japanese language with often Japanese pronunciation, which can be
Romaji, use of Latin script in Japanese language

Particles

Joshi (助詞), Japanese particles written in Hiragana, are suffixes or short words that follow a modified noun, verb, adjective, or sentence. Their grammatical range can indicate various meanings and functions:

case markers
parallel markers
sentence ending particles
interjectory particles
adverbial particles
binding particles
conjunctive particles
phrasal particles

Line breaking

Certain characters in Japanese should not come at the end of a line, certain characters should not come at the start of a line, and some characters should never be split up across two lines. These rules are called Kinsoku Shori 禁則処理:

simplified:

Class	Can't begin a line	Can't finish a line
small kana	ぁぃぅぇぉっ...
parentheses	）〉》】...	（〈《【...
quotations	」』”...	「『“...
punctuation	、。・！？...

Text segmentation

Written Japanese uses no spaces and little punctuation to delimit words. Readers instead depend on grammatical cues (e.g. Japanese, particles and verb endings), the relative frequency of character combinations, and semantic context, in order to determine what words have been written. This is a non trivial problem which is often solved by applying machine learning algorithms. Without a careful approach, breaks can occur randomly and usually in the middle of a word. This is an issue with typography on the web and results in a degradation of readability.

Zabon ???

I made a couple of assumptions when choosing the name:

🍊 The original algorithm name Mikan might be transscription of 蜜柑, a Japanese citrus fruit (Mandarin, Satsuma)
There already is a gem called mikan, didn't want to go for mikan_ruby or similar b/c of autoloading
🍇 My guess is the original author chose this name, b/c he was searching for something simpler then Google's Budou (葡萄)
🔪 Both fruits have in common, that they can be easily split apart in segments
So I was searching for another fruit that can be easily split apart, what can be split better apart than a Pomelo (文旦, ぶんたん) - Zabon (derived from Portoguese: zamboa)

Who knows if that's how it was 🤷🏻‍♂️😂.

The Algorithm

This algorithm does NOT find the most minimal segmentation of unbreakable text segments and probably will have problems if a text is solely written in one alphabet. It also does not support Furigana (yet). It does basic text segmentation and stitches the segments back together in segments which can be made unbreakable. The unbreakability we achieve by wrapping them in a tag with certain CSS rules.

Splitting

Split text across different alphabets used: split text into parts that are written in Kanjis, Hiragana, Katakana, Latin (incl. double width characters). The assumption here is that parts written in the same script should belong together.
Then split up each element further by splitting up particles are sequences that might be used as particles. The original author of the algorithm has identified the following list (でなければ, について, かしら, くらい, けれど, なのか, ばかり, ながら, ことよ, こそ, こと, さえ, しか, した, たり, だけ, だに, だの, つつ, ても, てよ, でも, とも, から, など, なりので, のに, ほど, まで, もの, やら, より, って, で, と, な, に, ね, の, も, は, ば, へ, や, わ, を, か, が, さ, し, ぞ, て). To me that looks about right, but maybe there are missing some.
Split along further by splitting up brackets and quotations: ([,〈,《,「,『,｢,【,〔,〚,〖,〘,❮,❬,❪,❨,(,<,{,❲,❰,｛,❴,] + the matching end brackets and quotations.

Stitching

Now we have a list of minimal segments and try to stitch them back together in a result set, so that they will fulfil Japanese line breaking rules. We are gonna look at tuples from left to right, looking at the current segment and the previous segment.
If the current segment is a beginning bracket or quotation; we look at the next segment, we have a definitiv start of an unbreakable segment.
If the current segment is an ending bracket or quotation; we append to the last entry of the result set and don't look back anymore; we've reached the end of a segment and start a new one with the next iteration.
If the previous segment is a beginning bracket; we stitch it together with the current segment to become a new segment. In the next iteration we don’t need to look at the previous segment anymore and continue.
If he current segment is a particle or a punctuation mark and we are not looking back (see step 7.); we append the current segment to the last entry of the result set.
If he current segment is a particle or a punctuation mark or if the previous segment is not a bracket, quotation or punctuation mark or a conjunctive particle (と, の,に) and the current segment is in Hiragana; we append to the last entry of the result set.
If no condition from stiching steps 1-2 are matching we can safely add the current segment to the result set.

Other solutions

Google Budou

Budou is a python library, which uses word segmenters to analyze input sentences. It can concatenate proper into meaningful chunks utilizing part-of-speech tagging and other syntactic information. Processed chunks are wrapped in a SPAN tag. Depending on the text segmentation algorithm used, it also has support for Chinese & Korean. Since this library is written in Python, it cannot be used simply used in Ruby, PHP, or Node.js.

Text segmenter backends

You can choose different segmenter backends depending on the needs of your environment. Currently, the segmenters below are supported.

Google Cloud Natural Language API: external API calls, can be costly
MeCab: Japanese POS tagger & morphological analyzer with lots of language bindings, e.g. also used in Google Japanese Input and Japanese Input on Mac OS X
TinySegmenter: extremely compact word separation algorithm in Javascript which produces MeCab compatible word separation without depending on external APIs, no dictionaires, classifies input

TinySegmenter is an extremely compact word separation algorithm in Javascript which produces MeCab compatible word separation without depending on external APIs. It classifies the input by using entities like characters, N-Grams, Hiragana, Katakana (Japanese phonetic lettering system / syllabaries) and their combinations as features to determine whether a character is preceded by a word boundary. A [Naive Bayes]((https://towardsdatascience.com/naive-bayes-explained-9d2b96f4a9c0) model was trained using the RWCP corpus and to make that model even more compact Boosting was used for L1 norm regularization. Basically it compresess the model and get rid off redundant features as much as possible.

CSS `line-break: strict`

Worth knowing: CSS has had native support for some Kinsoku Shori rules for a while now.

p {
  line-break: strict;
  overflow-wrap: break-word;
}

line-break: strict applies character-level Unicode line-breaking rules, which covers small kana, prolonged sound marks, and common punctuation cases. Browser implementations are not perfectly consistent and the spec intentionally leaves the precise rule set up to the user agent, so you may see subtle differences across browsers.

Zabon takes a different approach. Instead of telling the browser where not to break, it wraps each segment in a display: inline-block element that the browser cannot split internally. This gives you semantic grouping that works the same way in every browser, and also opens up per-segment styling like hover effects, search highlighting, or animations. The trade-off is server-side processing and extra markup.

If basic punctuation and small-kana rules are all you need, line-break: strict is enough and has zero runtime cost.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
bin		bin
examples		examples
lib		lib
test		test
.bundler-audit.yml		.bundler-audit.yml
.gitignore		.gitignore
.rubocop.yml		.rubocop.yml
.ruby-gemset		.ruby-gemset
.ruby-version		.ruby-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
zabon.gemspec		zabon.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zabon.ruby 🍊

Usage

Configuration

Rails

Japanese grammar 🇯🇵

Writing system ✍️

Particles

Line breaking

Text segmentation

Zabon ???

The Algorithm

Splitting

Stitching

Other solutions

Google Budou

Text segmenter backends

CSS `line-break: strict`

Resources

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zabon.ruby 🍊

Usage

Configuration

Rails

Japanese grammar 🇯🇵

Writing system ✍️

Particles

Line breaking

Text segmentation

Zabon ???

The Algorithm

Splitting

Stitching

Other solutions

Google Budou

Text segmenter backends

CSS line-break: strict

Resources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages

CSS `line-break: strict`