Skip to content

dingsdax/zabon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

31 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

zabon.ruby ๐ŸŠ

A Ruby gem / Rails helper for dealing with Japanese line-breaking logic. It is basically a port of mikan.js, which implements a regular expression based algorithm to segment text into semantic chunks. No machine learning needed ๐Ÿค–โ˜บ๏ธ. In addition the resulting text segments can be wrapped in a configurable HTML tag. All praise ๐Ÿ‘๐Ÿ‘๐Ÿ‘ for the algorithm goes to trkbt10.

Usage

# split this sentence
Zabon.split('ใ“ใฎๆ–‡ใ‚’ๅˆ†ๅ‰ฒใ™ใ‚‹')
 => ["ใ“ใฎ", "ๆ–‡ใ‚’", "ๅˆ†ๅ‰ฒใ™ใ‚‹"]

Configuration

Configuration is used for tag that the results can be wrapped in. It's making heavy use of Rails tag helpers. E.g. put this in an initializer in your Rails app.

Zabon.configure do |config|
  config.tag = :div # default: :span
  config.tag_options = { class: 'zabon_trara', style: 'font_size: 5em' } # default:  { class: 'zabon', style: 'display: inline-block' }
  config.strip_tags = false # default true
end

Rails

The gem ships a Railtie that automatically includes Zabon::Helper into ActionView::Base, so zabon_translate is available in all views without any further setup.

Call zabon_translate directly in views for strings that need segmentation:

<%= zabon_translate("page.title") %>

To replace the standard t() globally, add the following to an initializer. Note that this affects all ActionView translation calls โ€” prefer explicit zabon_translate for finer control.

# config/initializers/zabon.rb
module ActionView
  module Helpers
    module TranslationHelper
      alias_method :translate_without_zabon, :translate

      def translate(key, **options)
        zabon_translate(key, orig_translate: :translate_without_zabon, **options)
      end

      alias t translate
    end
  end
end

Japanese grammar ๐Ÿ‡ฏ๐Ÿ‡ต

Just enough Japanese to understand the algorithm :)

Writing system โœ๏ธ

The Japanese writing system uses for different components:

Particles

Joshi (ๅŠฉ่ฉž), Japanese particles written in Hiragana, are suffixes or short words that follow a modified noun, verb, adjective, or sentence. Their grammatical range can indicate various meanings and functions:

  • case markers
  • parallel markers
  • sentence ending particles
  • interjectory particles
  • adverbial particles
  • binding particles
  • conjunctive particles
  • phrasal particles

Line breaking

Certain characters in Japanese should not come at the end of a line, certain characters should not come at the start of a line, and some characters should never be split up across two lines. These rules are called Kinsoku Shori ็ฆๅ‰‡ๅ‡ฆ็†:

simplified:

Class Can't begin a line Can't finish a line
small kana ใใƒใ…ใ‡ใ‰ใฃ...
parentheses ๏ผ‰ใ€‰ใ€‹ใ€‘... ๏ผˆใ€ˆใ€Šใ€...
quotations ใ€ใ€โ€... ใ€Œใ€Žโ€œ...
punctuation ใ€ใ€‚ใƒป๏ผ๏ผŸ...

Text segmentation

Written Japanese uses no spaces and little punctuation to delimit words. Readers instead depend on grammatical cues (e.g. Japanese, particles and verb endings), the relative frequency of character combinations, and semantic context, in order to determine what words have been written. This is a non trivial problem which is often solved by applying machine learning algorithms. Without a careful approach, breaks can occur randomly and usually in the middle of a word. This is an issue with typography on the web and results in a degradation of readability.

Zabon ???

I made a couple of assumptions when choosing the name:

  1. ๐ŸŠ The original algorithm name Mikan might be transscription of ่œœๆŸ‘, a Japanese citrus fruit (Mandarin, Satsuma)
  2. There already is a gem called mikan, didn't want to go for mikan_ruby or similar b/c of autoloading
  3. ๐Ÿ‡ My guess is the original author chose this name, b/c he was searching for something simpler then Google's Budou (่‘ก่„)
  4. ๐Ÿ”ช Both fruits have in common, that they can be easily split apart in segments
  5. So I was searching for another fruit that can be easily split apart, what can be split better apart than a Pomelo (ๆ–‡ๆ—ฆ, ใถใ‚“ใŸใ‚“) - Zabon (derived from Portoguese: zamboa)

Who knows if that's how it was ๐Ÿคท๐Ÿปโ€โ™‚๏ธ๐Ÿ˜‚.

The Algorithm

This algorithm does NOT find the most minimal segmentation of unbreakable text segments and probably will have problems if a text is solely written in one alphabet. It also does not support Furigana (yet). It does basic text segmentation and stitches the segments back together in segments which can be made unbreakable. The unbreakability we achieve by wrapping them in a tag with certain CSS rules.

Splitting

  1. Split text across different alphabets used: split text into parts that are written in Kanjis, Hiragana, Katakana, Latin (incl. double width characters). The assumption here is that parts written in the same script should belong together.

  2. Then split up each element further by splitting up particles are sequences that might be used as particles. The original author of the algorithm has identified the following list (ใงใชใ‘ใ‚Œใฐ, ใซใคใ„ใฆ, ใ‹ใ—ใ‚‰, ใใ‚‰ใ„, ใ‘ใ‚Œใฉ, ใชใฎใ‹, ใฐใ‹ใ‚Š, ใชใŒใ‚‰, ใ“ใจใ‚ˆ, ใ“ใ, ใ“ใจ, ใ•ใˆ, ใ—ใ‹, ใ—ใŸ, ใŸใ‚Š, ใ ใ‘, ใ ใซ, ใ ใฎ, ใคใค, ใฆใ‚‚, ใฆใ‚ˆ, ใงใ‚‚, ใจใ‚‚, ใ‹ใ‚‰, ใชใฉ, ใชใ‚Šใฎใง, ใฎใซ, ใปใฉ, ใพใง, ใ‚‚ใฎ, ใ‚„ใ‚‰, ใ‚ˆใ‚Š, ใฃใฆ, ใง, ใจ, ใช, ใซ, ใญ, ใฎ, ใ‚‚, ใฏ, ใฐ, ใธ, ใ‚„, ใ‚, ใ‚’, ใ‹, ใŒ, ใ•, ใ—, ใž, ใฆ). To me that looks about right, but maybe there are missing some.

  3. Split along further by splitting up brackets and quotations: ([,ใ€ˆ,ใ€Š,ใ€Œ,ใ€Ž,๏ฝข,ใ€,ใ€”,ใ€š,ใ€–,ใ€˜,โฎ,โฌ,โช,โจ,(,<,{,โฒ,โฐ,๏ฝ›,โด,] + the matching end brackets and quotations.

Stitching

  1. Now we have a list of minimal segments and try to stitch them back together in a result set, so that they will fulfil Japanese line breaking rules. We are gonna look at tuples from left to right, looking at the current segment and the previous segment.

  2. If the current segment is a beginning bracket or quotation; we look at the next segment, we have a definitiv start of an unbreakable segment.

  3. If the current segment is an ending bracket or quotation; we append to the last entry of the result set and don't look back anymore; we've reached the end of a segment and start a new one with the next iteration.

  4. If the previous segment is a beginning bracket; we stitch it together with the current segment to become a new segment. In the next iteration we donโ€™t need to look at the previous segment anymore and continue.

  5. If he current segment is a particle or a punctuation mark and we are not looking back (see step 7.); we append the current segment to the last entry of the result set.

  6. If he current segment is a particle or a punctuation mark or if the previous segment is not a bracket, quotation or punctuation mark or a conjunctive particle (ใจ, ใฎ,ใซ) and the current segment is in Hiragana; we append to the last entry of the result set.

  7. If no condition from stiching steps 1-2 are matching we can safely add the current segment to the result set.

Other solutions

Budou is a python library, which uses word segmenters to analyze input sentences. It can concatenate proper into meaningful chunks utilizing part-of-speech tagging and other syntactic information. Processed chunks are wrapped in a SPAN tag. Depending on the text segmentation algorithm used, it also has support for Chinese & Korean. Since this library is written in Python, it cannot be used simply used in Ruby, PHP, or Node.js.

Text segmenter backends

You can choose different segmenter backends depending on the needs of your environment. Currently, the segmenters below are supported.

  • Google Cloud Natural Language API: external API calls, can be costly
  • MeCab: Japanese POS tagger & morphological analyzer with lots of language bindings, e.g. also used in Google Japanese Input and Japanese Input on Mac OS X
  • TinySegmenter: extremely compact word separation algorithm in Javascript which produces MeCab compatible word separation without depending on external APIs, no dictionaires, classifies input

TinySegmenter is an extremely compact word separation algorithm in Javascript which produces MeCab compatible word separation without depending on external APIs. It classifies the input by using entities like characters, N-Grams, Hiragana, Katakana (Japanese phonetic lettering system / syllabaries) and their combinations as features to determine whether a character is preceded by a word boundary. A [Naive Bayes]((https://towardsdatascience.com/naive-bayes-explained-9d2b96f4a9c0) model was trained using the RWCP corpus and to make that model even more compact Boosting was used for L1 norm regularization. Basically it compresess the model and get rid off redundant features as much as possible.

CSS line-break: strict

Worth knowing: CSS has had native support for some Kinsoku Shori rules for a while now.

p {
  line-break: strict;
  overflow-wrap: break-word;
}

line-break: strict applies character-level Unicode line-breaking rules, which covers small kana, prolonged sound marks, and common punctuation cases. Browser implementations are not perfectly consistent and the spec intentionally leaves the precise rule set up to the user agent, so you may see subtle differences across browsers.

Zabon takes a different approach. Instead of telling the browser where not to break, it wraps each segment in a display: inline-block element that the browser cannot split internally. This gives you semantic grouping that works the same way in every browser, and also opens up per-segment styling like hover effects, search highlighting, or animations. The trade-off is server-side processing and extra markup.

If basic punctuation and small-kana rules are all you need, line-break: strict is enough and has zero runtime cost.

Resources

About

Japanese line-breaking logic Ruby gem & Rails helper

Topics

Resources

License

Stars

Watchers

Forks

Contributors