Skip to content

ilya000/ioDiacritics

Repository files navigation

ioDiacritics

Offline Swift and C++ library for Bosnian, Croatian, and Serbian diacritics restoration.

ioDiacritics restores stripped Latin diacritics in Bosnian, Croatian, and Serbian text:

Drzava takodjer moze.  ->  Država također može.
nasa drzava            ->  naša država
Drzava takodje moze.   ->  Država takođe može.

It is a small, deterministic, AI-free diacritic restoration engine for Swift/SwiftPM apps and portable C++17 projects. It runs fully offline, ships bundled dictionaries, needs no server, no Python, no machine-learning model, and no user text ever leaves the device.

Designed for:

  • macOS and iOS apps that need local diacritics restoration
  • Windows and Linux applications that need a linkable C++ library
  • keyboard/input-method workflows where latency and trust matter
  • clipboard/text-cleanup tools
  • Bosnian/Croatian vraćanje dijakritike and Serbian dešišavanje
  • restoring BCS/Bosnian-Croatian-Serbian/Serbo-Croatian Latin text written without č, ć, š, ž, đ,

Also known as: diacritics restoration, accent restoration, diacritization, rediacritization, dešišavanje, ošišana latinica, restoring Bosnian diacritics, restoring Croatian diacritics, and restoring Serbian Latin diacritics.

Try it — ready-to-run demo app

Want to see the library in action without writing any code? Download the ready-made macOS demo — signed and notarized, so it opens on any Mac with no warnings (just double-click):

➡️ Download ioDiacriticsDemo.dmg — ~6 MB, macOS 13+ (hosted on the demos repo's Releases)

Paste ošišana Bosnian/Croatian/Serbian text and get it restored (and optionally transliterated to Serbian Cyrillic), copy with one button. Full source — plus a cross-platform C++ / Dear ImGui build for Windows, macOS and Linux — lives in ilya000/ioDiacritics-Demos.

⌨️ Or type with diacritics, anywhere — Šišana. A macOS input method (IME): pick it from the keyboard menu like a layout and type bald Latin in any app — diacritics appear live (citajčitaj), and ambiguous words open the system candidate window (the same Chinese/Japanese-IME mechanism), casačasa · čaša · ćasa, picked with a number key. No window, no copy-paste, no Accessibility permission, fully offline.

➡️ Download Šišana installer (signed & notarized .pkg) — macOS 13+ · run the installer (choose all users or just me), then add Šišana under System Settings → Keyboard → Input Sources (it's listed under Serbian (Latin)). Source: Swift-macOS-InputMethod.

The package treats Bosnian, Croatian, and Serbian as closely related standard varieties under the Serbo-Croatian / BCS macrolanguage umbrella. ISO references:

  • Serbo-Croatian macrolanguage: ISO 639-3 hbs
  • Bosnian: ISO 639-1 bs, ISO 639-3 bos
  • Croatian: ISO 639-1 hr, ISO 639-3 hrv
  • Montenegrin: ISO 639-3 cnr
  • Serbian: ISO 639-1 sr, ISO 639-3 srp
  • Legacy Serbo-Croatian ISO 639-1 sh exists historically but is deprecated

Author: iLya Os (legal name: Ilya V. Osipov)
GitHub: https://github.com/ilya000
Home page: https://ctrl8.com/iodiacritics.html

Features

  • Swift-native: pure Swift package, no AppKit/UI dependency in the core engine.
  • C++17 port: portable CMake build for Windows, Linux, and non-Swift integrations.
  • Offline and private: no network calls, no cloud inference, no external service.
  • AI-free and deterministic: dictionary + frequency prior + conservative guards.
  • Precision-first: when uncertain, the library leaves the word unchanged.
  • Language packs: ship only the languages your app uses.
  • Validated dictionaries: published recall/edit-precision stats live with the package.
  • Keyboard-ready: one word lookup, streaming mode, left-context numeric guard.
  • Prepared-text mode: full-text restore can also use right context.
  • BCS-specific handling: đ/dj/d, , č/ć, numeric guard for sto, and Bosnian/Croatian/Serbian profile differences.

Demo Applications

ioDiacritics macOS demo restoring stripped Croatian/Serbian text

Demo source code lives in a separate GitHub repository: github.com/ilya000/ioDiacritics-Demos.

Demo Platforms Source
SwiftUI desktop demo macOS Swift-macOS
C++ Dear ImGui demo Windows, macOS, Linux Cpp-Windows-macOS-Linux

Both demos use the same shipped dictionaries and show the main workflow: paste or type ošišana text, restore Bosnian/Croatian/Serbian diacritics, highlight changed words, and copy the result. The Swift demo exercises the SwiftPM API used by macOS/iOS apps; the C++ demo exercises the portable C++17 port for Windows/Linux/native desktop integrations.

Supported Languages

Language ISO SwiftPM product Dictionary keys Headline recall Notes
Bosnian bs / bos ioDiacriticsBosnian 114,092 94.9% Ijekavian-safe; validated on clean and UGC corpora.
Croatian hr / hrv ioDiacriticsCroatian 156,922 96.1% First-tier pack; validated on clean and real-register corpora.
Montenegrin cnr ioDiacriticsSerbian Serbian pack Limited Shared BCS/Serbian forms restore; Montenegrin-specific ś/ź letters are preserved, not normalized away.
Serbian sr / srp ioDiacriticsSerbian 156,931 93.7% First-tier pack; validated on real Serbian UGC/forum text.

The important trust metric is edit precision: of the edits the library makes, how many are right. The current BCS packs measure around 99.5-100% edit precision depending on language/register. See RESEARCH.md and research/ for details.

Language neutrality matters for Bosnian, Croatian, and Serbian alike — none is singled out. All three are parallel first-tier language packs: separate dictionaries, separate profiles, separate resolve tables, and separate validation metrics, listed alphabetically throughout. No pack is the canonical/base language for the others. (Serbian additionally has two alphabets — Cyrillic and Latin — but that is a per-language note, not a privileged status.) Montenegrin is currently supported in a limited compatibility mode through the Serbian pack.

Research Basis

ioDiacritics is the software artifact of a larger empirical study on restoring stripped diacritics in the Serbo-Croatian / BCS macrolanguage (ISO 639-3 hbs). The package is not a toy word map: the shipped dictionaries and guard rules were built through repeated corpus collection, dictionary growth, ablation, error analysis, and independent validation.

Current shipped data volume and validation scope:

  • 427,945 reverse-index keys across the three bundled language packs.
  • 2,455 confident ambiguity-resolution rules: Bosnian 481, Croatian 578, Serbian 1,396 (Serbian trimmed from 5,150 in v0.8.2 — only the dead tail; recall unchanged).
  • Croatian was built to 171,938 candidate keys before being frequency-trimmed to 156,922 shipped keys; Bosnian was grown from 11,350 to 114,092 shipped keys.
  • The shipped reliability passports cover 10,562 documents/posts and 21,185 fixable word instances across Bosnian, Croatian, and Serbian.
  • A balanced language-register matrix covers Bosnian, Croatian, and Serbian across wiki, news, and Tatoeba cells: 11,310 documents and 30,573 fixable word instances under one comparable protocol.
  • Validation uses document-level bootstrap confidence intervals (B=2000, seed 12345), resampling documents rather than individual words.
  • Raw corpora are intentionally not bundled; reports and reproducibility notes live in research/ and tools/.

A scientific paper describing the method, datasets, validation protocol, and results is in preparation. A link/citation will be added here after publication.

Statistical Results

Shipped Product Passport

These are per-pack deployment numbers. They summarize each shipped language pack on its own validation corpus, so they are useful for product confidence but should not be read as a language ranking.

Pack Dictionary keys Resolve rules Validation sample Recall Edit precision OOV
Bosnian 114,092 481 943 clean Tatoeba docs / 1,332 fixable words 94.9% [93.7, 96.1] ≥99.8%* 0.9%
Croatian 156,922 578 3,614 clean Tatoeba docs / 5,510 fixable words 96.1% [95.6, 96.6] ≥99.9% (99.98%) 0.5%
Serbian 156,931 1,396 6,005 real UGC/forum posts / 14,343 fixable words 93.7% ~99.6% measured in validation reports

* Bosnian: 0 wrong edits observed on 1,332 fixable words — a tight upper bound (rule-of-three ≤~0.2% at 95% CI), not a literal 100%-precision guarantee.

Comparable Cross-Language Matrix

For a fair Bosnian/Croatian/Serbian comparison, use the symmetric matrix: one protocol, matched registers, resolve off, independent corpora, document-level bootstrap CI. This is the main neutrality check.

Register Bosnian recall Croatian recall Serbian recall Edit precision
Wiki 86.9% [85.7, 87.9] 87.4% [86.2, 88.5] 84.1% [82.9, 85.3] 99.8-100%
News 88.0% [87.0, 89.0] 89.4% [88.5, 90.3] 82.2% [81.1, 83.3] 99.4-99.9%
Tatoeba 86.0% [83.9, 87.9] 86.6% [84.7, 88.5] 86.7% [84.6, 88.7] 99.9-100%

Main conclusion: when register, protocol, and evaluation setup are held fixed, the engine is not Serbian-first or Croatian-first. The differences are mostly dictionary coverage/register effects, while edit precision stays near 100%.

Register And Resolve Effects

Experiment Before After Meaning
Bosnian wiki dictionary growth 60.4% at 11,350 keys 86.9% at 114,092 keys Bosnian was coverage-limited, not algorithm-limited.
Croatian wiki dictionary growth 82.6% at 78,562 keys 87.4% at 156,922 keys More validated coverage improved recall without hurting precision.
Croatian chat resolve table 80.0% resolve off 89.5% resolve on Confident ambiguity rules add about +9.5pp recall.
Serbian chat resolve table 81.9% resolve off 91.4% resolve on Same +9.5pp effect under the same protocol.
Serbian real human stripped text (SentiComments.SR) 82.8% resolve off 93.9% resolve on Synthetic shaving does not overstate Serbian real-error performance.

Baselines

A naive most-frequent baseline can raise recall but makes many more wrong edits. On Serbian chat it measured 85.0% recall but only 89.5% edit precision; ioDiacritics with resolve measured 91.4% recall and 99.7% edit precision on the same cell. This is why the package optimizes for edit precision rather than raw recall alone.

Swift Installation

Requirements:

  • Swift 5.9+
  • macOS 13+ / iOS 15+
  • Swift Package Manager

Add the package to your Package.swift:

dependencies: [
    .package(url: "https://github.com/ilya000/ioDiacritics.git", from: "0.9.3")
]

Then depend on only the language packs you need:

.target(
    name: "YourApp",
    dependencies: [
        .product(name: "ioDiacriticsBosnian", package: "ioDiacritics"),
        .product(name: "ioDiacriticsCroatian", package: "ioDiacritics"),
        .product(name: "ioDiacriticsSerbian", package: "ioDiacritics")
    ]
)

For local development:

.package(path: "../ioDiacritics")

C++ / Windows / Linux

The C++ port lives in cpp/. It is a dependency-free C++17 library with a CMake build. It uses the same shipped JSON dictionaries as the Swift package, so the linguistic data and quality numbers stay shared.

Build on Linux/macOS:

cmake -S cpp -B build-cpp -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp
ctest --test-dir build-cpp

Build on Windows with Visual Studio:

cmake -S cpp -B build-cpp -G "Visual Studio 17 2022" -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp --config Release
ctest --test-dir build-cpp -C Release

Minimal C++ usage:

#include <iodiacritics/iodiacritics.hpp>

using iodiacritics::Restorer;

auto sr = Restorer::load_file(
    "Sources/ioDiacriticsSerbian/Resources/deshishana_sr.json",
    iodiacritics::serbian_profile());

auto fixed = sr.restore_prepared_text("Drzava takodje moze.");
// Država takođe može.

The C++ API mirrors the Swift engine:

  • restore(...) for one token;
  • restore_text(...) for streaming/live text with left-context numeric guard;
  • restore_prepared_text(...) for completed text with left and right numeric context;
  • is_language(...) when loading the optional invariant word set.

Quick Start

import ioDiacriticsBosnian

let text = "Drzava takodjer moze."
let restored = Bosnian.shared?.restorePreparedText(text)

print(restored ?? text)
// Država također može.

Croatian:

import ioDiacriticsCroatian

let text = "nasa drzava"
let restored = Croatian.shared?.restorePreparedText(text)

print(restored ?? text)
// naša država

Serbian:

import ioDiacriticsSerbian

let text = "Drzava takodje moze."
let restored = Serbian.shared?.restorePreparedText(text)

print(restored ?? text)
// Država takođe može.

API

Each language pack exposes a shared lazy restorer and measured stats:

Bosnian.shared
Croatian.shared
Serbian.shared

Bosnian.stats.summary
Croatian.stats.summary
Serbian.stats.summary

Prepared Text

Use restorePreparedText(_:) when the whole text is available, such as clipboard content, imported text, files, notes, or a text area after editing.

let fixed = Serbian.shared?.restorePreparedText("sto evra")
// "sto evra" because `sto` can mean 100 and `evra` is visible on the right

Prepared-text mode can look both left and right for numeric/measure context.

Streaming Text

Use restoreText(_:) for streaming/live transforms where the future is not available yet.

let fixed = Serbian.shared?.restoreText("100 sto")
// "100 sto"

let live = Serbian.shared?.restoreText("sto evra")
// "što evra" because live mode intentionally cannot see right context

This is useful for keyboard-like workflows and preserves the original live-input behavior.

One Word

Use restore(_:prevWord:nextWord:isForeignWord:) when your app already tokenizes text.

let word = Serbian.shared?.restore("zelim")
// "želim"

let guarded = Serbian.shared?.restore("sto", prevWord: "100")
// nil: pass through unchanged

nil means "leave the token unchanged." That makes call sites simple and keeps the engine precision-first.

Why Not Just Use AI?

Neural systems such as BERT or ByT5 can reach very high accuracy for diacritics restoration, especially on clean benchmark corpora. ioDiacritics solves a different product problem:

  • it is tiny compared with an ML model
  • it loads locally inside a Swift app
  • it does not need GPU/CPU-heavy inference
  • it is deterministic and testable
  • it is easy to ship in keyboard and clipboard utilities
  • it prefers missed restorations over wrong edits

For end-user input tools, a conservative wrong-edit rate is often more important than squeezing out the last point of recall.

How It Works

The engine builds a reverse index from a stripped, "bald" surface form to valid diacritic candidates:

Class Example Behavior
Invariant telefon Leave unchanged.
Deterministic zelim -> želim Restore directly.
Ambiguous casa -> časa/čaša/... Use frequency order when safe.
Valid bald homograph sto vs što Usually hold, unless a confident resolve rule applies.

Important details:

  • đ can be typed as dj or d, so dictionary keys are expanded at build time.
  • is handled as a digraph.
  • č and ć both collapse to c, so ambiguity is real.
  • Numeric guard protects words like sto near numbers and measure words.
  • The runtime hot path is a dictionary lookup plus small rule checks.

Architecture

Target Contents
ioDiacritics Generic engine: Restorer, LanguageProfile, LangStats, version.
ioDiacriticsBosnian Bosnian profile, bundled dictionary, stats, Bosnian.shared.
ioDiacriticsCroatian Croatian profile, bundled dictionary, stats, Croatian.shared.
ioDiacriticsSerbian Serbian profile, bundled dictionary, stats, Serbian.shared.

The core target has no resources and no UI. A consumer depends only on the language packs it ships, so an app does not bundle dictionaries it does not use.

Reliability Passport

Each language pack includes a LangStats value:

print(Serbian.stats.summary)

This gives apps an About-panel friendly reliability summary: dictionary size, recall, wrong-edit rate, edit precision, and validation corpus size. Drift tests ensure the published dictionary key counts stay in sync with the bundled JSON resources.

Adding Another Language

Most Latin-script languages need data plus a small profile:

  1. Build a reverse dictionary with tools/build_deshishana.py.
  2. Add language-specific strip rules in LanguageProfile.
  3. Add a thin ioDiacritics<Language> target with bundled deshishana_<code>.json.
  4. Validate by shaving accented text, restoring it, and comparing against the original.
  5. Publish LangStats with recall/edit-error numbers.

Known special cases:

  • Turkish needs careful ı/i handling.
  • German has digraph conventions such as ä -> ae, ß -> ss.
  • Vietnamese likely needs a stronger context model before dictionary-only restoration is good enough.

See BACKLOG.md for roadmap notes.

Prior Art

ioDiacritics is not the first diacritics restoration system. It is deliberately a small offline Swift package for product use.

Related work includes:

  • turanjanin/serbian-language-tools: PHP library for Serbian transliteration and diacritic restoration.
  • clarinsi/redi: Croatian, Serbian, and Slovene restoration tool with optional language models.
  • BERT/Transformer-based diacritics restoration research for many languages.

The distinguishing feature here is the packaging and product shape: SwiftPM, bundled BCS language packs, no network, no AI model, high edit precision, and keyboard/clipboard-friendly APIs.

Project Files

License

Swift source code, tests, scripts, and documentation are licensed under the MIT License. See LICENSE.

The bundled JSON dictionaries are generated data artifacts derived from third-party lexical and corpus resources. They have separate provenance and attribution requirements. See NOTICE.md and docs/DATA_LICENSE_AUDIT.md before publishing or shipping a public binary that includes the dictionaries.

ioDiacritics is provided "as is", without warranties of any kind, to the fullest extent permitted by applicable law. By downloading or using it you accept the terms of use.

About

Offline library for restoring Bosnian, Croatian, Serbian, and limited Montenegrin diacritics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors