ioDiacritics

Offline Swift and C++ library for Bosnian, Croatian, and Serbian diacritics restoration.

ioDiacritics restores stripped Latin diacritics in Bosnian, Croatian, and Serbian text:

Drzava takodjer moze.  ->  Država također može.
nasa drzava            ->  naša država
Drzava takodje moze.   ->  Država takođe može.

It is a small, deterministic, AI-free diacritic restoration engine for Swift/SwiftPM apps and portable C++17 projects. It runs fully offline, ships bundled dictionaries, needs no server, no Python, no machine-learning model, and no user text ever leaves the device.

Designed for:

macOS and iOS apps that need local diacritics restoration
Windows and Linux applications that need a linkable C++ library
keyboard/input-method workflows where latency and trust matter
clipboard/text-cleanup tools
Bosnian/Croatian vraćanje dijakritike and Serbian dešišavanje
restoring BCS/Bosnian-Croatian-Serbian/Serbo-Croatian Latin text written without č, ć, š, ž, đ, dž

Also known as: diacritics restoration, accent restoration, diacritization, rediacritization, dešišavanje, ošišana latinica, restoring Bosnian diacritics, restoring Croatian diacritics, and restoring Serbian Latin diacritics.

Try it — ready-to-run demo app

Want to see the library in action without writing any code? Download the ready-made macOS demo — signed and notarized, so it opens on any Mac with no warnings (just double-click):

➡️ Download ioDiacriticsDemo.dmg — ~6 MB, macOS 13+ (hosted on the demos repo's Releases)

Paste ošišana Bosnian/Croatian/Serbian text and get it restored (and optionally transliterated to Serbian Cyrillic), copy with one button. Full source — plus a cross-platform C++ / Dear ImGui build for Windows, macOS and Linux — lives in ilya000/ioDiacritics-Demos.

⌨️ Or type with diacritics, anywhere — Šišana. A macOS input method (IME): pick it from the keyboard menu like a layout and type bald Latin in any app — diacritics appear live (citaj → čitaj), and ambiguous words open the system candidate window (the same Chinese/Japanese-IME mechanism), casa → časa · čaša · ćasa, picked with a number key. No window, no copy-paste, no Accessibility permission, fully offline.

➡️ Download Šišana installer (signed & notarized .pkg) — macOS 13+ · run the installer (choose all users or just me), then add Šišana under System Settings → Keyboard → Input Sources (it's listed under Serbian (Latin)). Source: Swift-macOS-InputMethod.

The package treats Bosnian, Croatian, and Serbian as closely related standard varieties under the Serbo-Croatian / BCS macrolanguage umbrella. ISO references:

Serbo-Croatian macrolanguage: ISO 639-3 hbs
Bosnian: ISO 639-1 bs, ISO 639-3 bos
Croatian: ISO 639-1 hr, ISO 639-3 hrv
Montenegrin: ISO 639-3 cnr
Serbian: ISO 639-1 sr, ISO 639-3 srp
Legacy Serbo-Croatian ISO 639-1 sh exists historically but is deprecated

Author: iLya Os (legal name: Ilya V. Osipov)
GitHub: https://github.com/ilya000
Home page: https://ctrl8.com/iodiacritics.html

Features

Swift-native: pure Swift package, no AppKit/UI dependency in the core engine.
C++17 port: portable CMake build for Windows, Linux, and non-Swift integrations.
Offline and private: no network calls, no cloud inference, no external service.
AI-free and deterministic: dictionary + frequency prior + conservative guards.
Precision-first: when uncertain, the library leaves the word unchanged.
Language packs: ship only the languages your app uses.
Validated dictionaries: published recall/edit-precision stats live with the package.
Keyboard-ready: one word lookup, streaming mode, left-context numeric guard.
Prepared-text mode: full-text restore can also use right context.
BCS-specific handling: đ/dj/d, dž, č/ć, numeric guard for sto, and Bosnian/Croatian/Serbian profile differences.

Demo Applications

Demo source code lives in a separate GitHub repository: github.com/ilya000/ioDiacritics-Demos.

Demo	Platforms	Source
SwiftUI desktop demo	macOS	`Swift-macOS`
C++ Dear ImGui demo	Windows, macOS, Linux	`Cpp-Windows-macOS-Linux`

Both demos use the same shipped dictionaries and show the main workflow: paste or type ošišana text, restore Bosnian/Croatian/Serbian diacritics, highlight changed words, and copy the result. The Swift demo exercises the SwiftPM API used by macOS/iOS apps; the C++ demo exercises the portable C++17 port for Windows/Linux/native desktop integrations.

Supported Languages

Language	ISO	SwiftPM product	Dictionary keys	Headline recall	Notes
Bosnian	`bs` / `bos`	`ioDiacriticsBosnian`	114,092	94.9%	Ijekavian-safe; validated on clean and UGC corpora.
Croatian	`hr` / `hrv`	`ioDiacriticsCroatian`	156,922	96.1%	First-tier pack; validated on clean and real-register corpora.
Montenegrin	`cnr`	`ioDiacriticsSerbian`	Serbian pack	Limited	Shared BCS/Serbian forms restore; Montenegrin-specific `ś`/`ź` letters are preserved, not normalized away.
Serbian	`sr` / `srp`	`ioDiacriticsSerbian`	156,931	93.7%	First-tier pack; validated on real Serbian UGC/forum text.

The important trust metric is edit precision: of the edits the library makes, how many are right. The current BCS packs measure around 99.5-100% edit precision depending on language/register. See RESEARCH.md and research/ for details.

Language neutrality matters for Bosnian, Croatian, and Serbian alike — none is singled out. All three are parallel first-tier language packs: separate dictionaries, separate profiles, separate resolve tables, and separate validation metrics, listed alphabetically throughout. No pack is the canonical/base language for the others. (Serbian additionally has two alphabets — Cyrillic and Latin — but that is a per-language note, not a privileged status.) Montenegrin is currently supported in a limited compatibility mode through the Serbian pack.

Research Basis

ioDiacritics is the software artifact of a larger empirical study on restoring stripped diacritics in the Serbo-Croatian / BCS macrolanguage (ISO 639-3 hbs). The package is not a toy word map: the shipped dictionaries and guard rules were built through repeated corpus collection, dictionary growth, ablation, error analysis, and independent validation.

Current shipped data volume and validation scope:

427,945 reverse-index keys across the three bundled language packs.
2,455 confident ambiguity-resolution rules: Bosnian 481, Croatian 578, Serbian 1,396 (Serbian trimmed from 5,150 in v0.8.2 — only the dead tail; recall unchanged).
Croatian was built to 171,938 candidate keys before being frequency-trimmed to 156,922 shipped keys; Bosnian was grown from 11,350 to 114,092 shipped keys.
The shipped reliability passports cover 10,562 documents/posts and 21,185 fixable word instances across Bosnian, Croatian, and Serbian.
A balanced language-register matrix covers Bosnian, Croatian, and Serbian across wiki, news, and Tatoeba cells: 11,310 documents and 30,573 fixable word instances under one comparable protocol.
Validation uses document-level bootstrap confidence intervals (B=2000, seed 12345), resampling documents rather than individual words.
Raw corpora are intentionally not bundled; reports and reproducibility notes live in research/ and tools/.

A scientific paper describing the method, datasets, validation protocol, and results is in preparation. A link/citation will be added here after publication.

Statistical Results

Shipped Product Passport

These are per-pack deployment numbers. They summarize each shipped language pack on its own validation corpus, so they are useful for product confidence but should not be read as a language ranking.

Pack	Dictionary keys	Resolve rules	Validation sample	Recall	Edit precision	OOV
Bosnian	114,092	481	943 clean Tatoeba docs / 1,332 fixable words	94.9% [93.7, 96.1]	≥99.8%*	0.9%
Croatian	156,922	578	3,614 clean Tatoeba docs / 5,510 fixable words	96.1% [95.6, 96.6]	≥99.9% (99.98%)	0.5%
Serbian	156,931	1,396	6,005 real UGC/forum posts / 14,343 fixable words	93.7%	~99.6%	measured in validation reports

* Bosnian: 0 wrong edits observed on 1,332 fixable words — a tight upper bound (rule-of-three ≤~0.2% at 95% CI), not a literal 100%-precision guarantee.

Comparable Cross-Language Matrix

For a fair Bosnian/Croatian/Serbian comparison, use the symmetric matrix: one protocol, matched registers, resolve off, independent corpora, document-level bootstrap CI. This is the main neutrality check.

Register	Bosnian recall	Croatian recall	Serbian recall	Edit precision
Wiki	86.9% [85.7, 87.9]	87.4% [86.2, 88.5]	84.1% [82.9, 85.3]	99.8-100%
News	88.0% [87.0, 89.0]	89.4% [88.5, 90.3]	82.2% [81.1, 83.3]	99.4-99.9%
Tatoeba	86.0% [83.9, 87.9]	86.6% [84.7, 88.5]	86.7% [84.6, 88.7]	99.9-100%

Main conclusion: when register, protocol, and evaluation setup are held fixed, the engine is not Serbian-first or Croatian-first. The differences are mostly dictionary coverage/register effects, while edit precision stays near 100%.

Register And Resolve Effects

Experiment	Before	After	Meaning
Bosnian wiki dictionary growth	60.4% at 11,350 keys	86.9% at 114,092 keys	Bosnian was coverage-limited, not algorithm-limited.
Croatian wiki dictionary growth	82.6% at 78,562 keys	87.4% at 156,922 keys	More validated coverage improved recall without hurting precision.
Croatian chat resolve table	80.0% resolve off	89.5% resolve on	Confident ambiguity rules add about +9.5pp recall.
Serbian chat resolve table	81.9% resolve off	91.4% resolve on	Same +9.5pp effect under the same protocol.
Serbian real human stripped text (`SentiComments.SR`)	82.8% resolve off	93.9% resolve on	Synthetic shaving does not overstate Serbian real-error performance.

Baselines

A naive most-frequent baseline can raise recall but makes many more wrong edits. On Serbian chat it measured 85.0% recall but only 89.5% edit precision; ioDiacritics with resolve measured 91.4% recall and 99.7% edit precision on the same cell. This is why the package optimizes for edit precision rather than raw recall alone.

Swift Installation

Requirements:

Swift 5.9+
macOS 13+ / iOS 15+
Swift Package Manager

Add the package to your Package.swift:

dependencies: [
    .package(url: "https://github.com/ilya000/ioDiacritics.git", from: "0.9.3")
]

Then depend on only the language packs you need:

.target(
    name: "YourApp",
    dependencies: [
        .product(name: "ioDiacriticsBosnian", package: "ioDiacritics"),
        .product(name: "ioDiacriticsCroatian", package: "ioDiacritics"),
        .product(name: "ioDiacriticsSerbian", package: "ioDiacritics")
    ]
)

For local development:

.package(path: "../ioDiacritics")

C++ / Windows / Linux

The C++ port lives in cpp/. It is a dependency-free C++17 library with a CMake build. It uses the same shipped JSON dictionaries as the Swift package, so the linguistic data and quality numbers stay shared.

Build on Linux/macOS:

cmake -S cpp -B build-cpp -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp
ctest --test-dir build-cpp

Build on Windows with Visual Studio:

cmake -S cpp -B build-cpp -G "Visual Studio 17 2022" -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp --config Release
ctest --test-dir build-cpp -C Release

Minimal C++ usage:

#include <iodiacritics/iodiacritics.hpp>

using iodiacritics::Restorer;

auto sr = Restorer::load_file(
    "Sources/ioDiacriticsSerbian/Resources/deshishana_sr.json",
    iodiacritics::serbian_profile());

auto fixed = sr.restore_prepared_text("Drzava takodje moze.");
// Država takođe može.

The C++ API mirrors the Swift engine:

restore(...) for one token;
restore_text(...) for streaming/live text with left-context numeric guard;
restore_prepared_text(...) for completed text with left and right numeric context;
is_language(...) when loading the optional invariant word set.

Quick Start

import ioDiacriticsBosnian

let text = "Drzava takodjer moze."
let restored = Bosnian.shared?.restorePreparedText(text)

print(restored ?? text)
// Država također može.

Croatian:

import ioDiacriticsCroatian

let text = "nasa drzava"
let restored = Croatian.shared?.restorePreparedText(text)

print(restored ?? text)
// naša država

Serbian:

import ioDiacriticsSerbian

let text = "Drzava takodje moze."
let restored = Serbian.shared?.restorePreparedText(text)

print(restored ?? text)
// Država takođe može.

API

Each language pack exposes a shared lazy restorer and measured stats:

Bosnian.shared
Croatian.shared
Serbian.shared

Bosnian.stats.summary
Croatian.stats.summary
Serbian.stats.summary

Prepared Text

Use restorePreparedText(_:) when the whole text is available, such as clipboard content, imported text, files, notes, or a text area after editing.

let fixed = Serbian.shared?.restorePreparedText("sto evra")
// "sto evra" because `sto` can mean 100 and `evra` is visible on the right

Prepared-text mode can look both left and right for numeric/measure context.

Streaming Text

Use restoreText(_:) for streaming/live transforms where the future is not available yet.

let fixed = Serbian.shared?.restoreText("100 sto")
// "100 sto"

let live = Serbian.shared?.restoreText("sto evra")
// "što evra" because live mode intentionally cannot see right context

This is useful for keyboard-like workflows and preserves the original live-input behavior.

One Word

Use restore(_:prevWord:nextWord:isForeignWord:) when your app already tokenizes text.

let word = Serbian.shared?.restore("zelim")
// "želim"

let guarded = Serbian.shared?.restore("sto", prevWord: "100")
// nil: pass through unchanged

nil means "leave the token unchanged." That makes call sites simple and keeps the engine precision-first.

Why Not Just Use AI?

Neural systems such as BERT or ByT5 can reach very high accuracy for diacritics restoration, especially on clean benchmark corpora. ioDiacritics solves a different product problem:

it is tiny compared with an ML model
it loads locally inside a Swift app
it does not need GPU/CPU-heavy inference
it is deterministic and testable
it is easy to ship in keyboard and clipboard utilities
it prefers missed restorations over wrong edits

For end-user input tools, a conservative wrong-edit rate is often more important than squeezing out the last point of recall.

How It Works

The engine builds a reverse index from a stripped, "bald" surface form to valid diacritic candidates:

Class	Example	Behavior
Invariant	`telefon`	Leave unchanged.
Deterministic	`zelim -> želim`	Restore directly.
Ambiguous	`casa -> časa/čaša/...`	Use frequency order when safe.
Valid bald homograph	`sto` vs `što`	Usually hold, unless a confident resolve rule applies.

Important details:

đ can be typed as dj or d, so dictionary keys are expanded at build time.
dž is handled as a digraph.
č and ć both collapse to c, so ambiguity is real.
Numeric guard protects words like sto near numbers and measure words.
The runtime hot path is a dictionary lookup plus small rule checks.

Architecture

Target	Contents
`ioDiacritics`	Generic engine: `Restorer`, `LanguageProfile`, `LangStats`, version.
`ioDiacriticsBosnian`	Bosnian profile, bundled dictionary, stats, `Bosnian.shared`.
`ioDiacriticsCroatian`	Croatian profile, bundled dictionary, stats, `Croatian.shared`.
`ioDiacriticsSerbian`	Serbian profile, bundled dictionary, stats, `Serbian.shared`.

The core target has no resources and no UI. A consumer depends only on the language packs it ships, so an app does not bundle dictionaries it does not use.

Reliability Passport

Each language pack includes a LangStats value:

print(Serbian.stats.summary)

This gives apps an About-panel friendly reliability summary: dictionary size, recall, wrong-edit rate, edit precision, and validation corpus size. Drift tests ensure the published dictionary key counts stay in sync with the bundled JSON resources.

Adding Another Language

Most Latin-script languages need data plus a small profile:

Build a reverse dictionary with tools/build_deshishana.py.
Add language-specific strip rules in LanguageProfile.
Add a thin ioDiacritics<Language> target with bundled deshishana_<code>.json.
Validate by shaving accented text, restoring it, and comparing against the original.
Publish LangStats with recall/edit-error numbers.

Known special cases:

Turkish needs careful ı/i handling.
German has digraph conventions such as ä -> ae, ß -> ss.
Vietnamese likely needs a stronger context model before dictionary-only restoration is good enough.

See BACKLOG.md for roadmap notes.

Prior Art

ioDiacritics is not the first diacritics restoration system. It is deliberately a small offline Swift package for product use.

Related work includes:

turanjanin/serbian-language-tools: PHP library for Serbian transliteration and diacritic restoration.
clarinsi/redi: Croatian, Serbian, and Slovene restoration tool with optional language models.
BERT/Transformer-based diacritics restoration research for many languages.

The distinguishing feature here is the packaging and product shape: SwiftPM, bundled BCS language packs, no network, no AI model, high edit precision, and keyboard/clipboard-friendly APIs.

Project Files

CHANGELOG.md: version history.
RESEARCH.md: paper-facing architecture and validation snapshot.
research/: validation reports and corpus matrix notes.
tools/: dictionary build and evaluation scripts.
NOTICE.md: attribution and bundled-data notices.
docs/DATA_LICENSE_AUDIT.md: data licensing audit.

License

Swift source code, tests, scripts, and documentation are licensed under the MIT License. See LICENSE.

The bundled JSON dictionaries are generated data artifacts derived from third-party lexical and corpus resources. They have separate provenance and attribution requirements. See NOTICE.md and docs/DATA_LICENSE_AUDIT.md before publishing or shipping a public binary that includes the dictionaries.

ioDiacritics is provided "as is", without warranties of any kind, to the fullest extent permitted by applicable law. By downloading or using it you accept the terms of use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ioDiacritics

Try it — ready-to-run demo app

Features

Demo Applications

Supported Languages

Research Basis

Statistical Results

Shipped Product Passport

Comparable Cross-Language Matrix

Register And Resolve Effects

Baselines

Swift Installation

C++ / Windows / Linux

Quick Start

API

Prepared Text

Streaming Text

One Word

Why Not Just Use AI?

How It Works

Architecture

Reliability Passport

Adding Another Language

Prior Art

Project Files

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Sources		Sources
Tests/ioDiacriticsTests		Tests/ioDiacriticsTests
cpp		cpp
docs		docs
research		research
tools		tools
.gitignore		.gitignore
BACKLOG.md		BACKLOG.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
NOTICE.md		NOTICE.md
Package.swift		Package.swift
README.md		README.md
RESEARCH.md		RESEARCH.md

Folders and files

Latest commit

History

Repository files navigation

ioDiacritics

Try it — ready-to-run demo app

Features

Demo Applications

Supported Languages

Research Basis

Statistical Results

Shipped Product Passport

Comparable Cross-Language Matrix

Register And Resolve Effects

Baselines

Swift Installation

C++ / Windows / Linux

Quick Start

API

Prepared Text

Streaming Text

One Word

Why Not Just Use AI?

How It Works

Architecture

Reliability Passport

Adding Another Language

Prior Art

Project Files

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages