Offline Swift and C++ library for Bosnian, Croatian, and Serbian diacritics restoration.
ioDiacritics restores stripped Latin diacritics in Bosnian, Croatian, and Serbian text:
Drzava takodjer moze. -> Država također može.
nasa drzava -> naša država
Drzava takodje moze. -> Država takođe može.
It is a small, deterministic, AI-free diacritic restoration engine for Swift/SwiftPM apps and portable C++17 projects. It runs fully offline, ships bundled dictionaries, needs no server, no Python, no machine-learning model, and no user text ever leaves the device.
Designed for:
- macOS and iOS apps that need local diacritics restoration
- Windows and Linux applications that need a linkable C++ library
- keyboard/input-method workflows where latency and trust matter
- clipboard/text-cleanup tools
- Bosnian/Croatian
vraćanje dijakritikeand Serbiandešišavanje - restoring BCS/Bosnian-Croatian-Serbian/Serbo-Croatian Latin text written without
č,ć,š,ž,đ,dž
Also known as: diacritics restoration, accent restoration, diacritization, rediacritization,
dešišavanje, ošišana latinica, restoring Bosnian diacritics, restoring Croatian
diacritics, and restoring Serbian Latin diacritics.
Want to see the library in action without writing any code? Download the ready-made macOS demo — signed and notarized, so it opens on any Mac with no warnings (just double-click):
➡️ Download ioDiacriticsDemo.dmg — ~6 MB, macOS 13+ (hosted on the demos repo's Releases)
Paste ošišana Bosnian/Croatian/Serbian text and get it restored (and optionally transliterated to Serbian Cyrillic), copy with one button. Full source — plus a cross-platform C++ / Dear ImGui build for Windows, macOS and Linux — lives in ilya000/ioDiacritics-Demos.
⌨️ Or type with diacritics, anywhere — Šišana. A macOS input method (IME): pick it from
the keyboard menu like a layout and type bald Latin in any app — diacritics appear live
(citaj → čitaj), and ambiguous words open the system candidate window (the same
Chinese/Japanese-IME mechanism), casa → časa · čaša · ćasa, picked with a number key. No
window, no copy-paste, no Accessibility permission, fully offline.
➡️ Download Šišana installer (signed & notarized .pkg) — macOS 13+ · run the installer (choose all users or just me), then add Šišana under System Settings → Keyboard → Input Sources (it's listed under Serbian (Latin)). Source: Swift-macOS-InputMethod.
The package treats Bosnian, Croatian, and Serbian as closely related standard varieties under the Serbo-Croatian / BCS macrolanguage umbrella. ISO references:
- Serbo-Croatian macrolanguage: ISO 639-3
hbs - Bosnian: ISO 639-1
bs, ISO 639-3bos - Croatian: ISO 639-1
hr, ISO 639-3hrv - Montenegrin: ISO 639-3
cnr - Serbian: ISO 639-1
sr, ISO 639-3srp - Legacy Serbo-Croatian ISO 639-1
shexists historically but is deprecated
Author: iLya Os (legal name: Ilya V. Osipov)
GitHub: https://github.com/ilya000
Home page: https://ctrl8.com/iodiacritics.html
- Swift-native: pure Swift package, no AppKit/UI dependency in the core engine.
- C++17 port: portable CMake build for Windows, Linux, and non-Swift integrations.
- Offline and private: no network calls, no cloud inference, no external service.
- AI-free and deterministic: dictionary + frequency prior + conservative guards.
- Precision-first: when uncertain, the library leaves the word unchanged.
- Language packs: ship only the languages your app uses.
- Validated dictionaries: published recall/edit-precision stats live with the package.
- Keyboard-ready: one word lookup, streaming mode, left-context numeric guard.
- Prepared-text mode: full-text restore can also use right context.
- BCS-specific handling:
đ/dj/d,dž,č/ć, numeric guard forsto, and Bosnian/Croatian/Serbian profile differences.
Demo source code lives in a separate GitHub repository: github.com/ilya000/ioDiacritics-Demos.
| Demo | Platforms | Source |
|---|---|---|
| SwiftUI desktop demo | macOS | Swift-macOS |
| C++ Dear ImGui demo | Windows, macOS, Linux | Cpp-Windows-macOS-Linux |
Both demos use the same shipped dictionaries and show the main workflow: paste or type ošišana text, restore Bosnian/Croatian/Serbian diacritics, highlight changed words, and copy the result. The Swift demo exercises the SwiftPM API used by macOS/iOS apps; the C++ demo exercises the portable C++17 port for Windows/Linux/native desktop integrations.
| Language | ISO | SwiftPM product | Dictionary keys | Headline recall | Notes |
|---|---|---|---|---|---|
| Bosnian | bs / bos |
ioDiacriticsBosnian |
114,092 | 94.9% | Ijekavian-safe; validated on clean and UGC corpora. |
| Croatian | hr / hrv |
ioDiacriticsCroatian |
156,922 | 96.1% | First-tier pack; validated on clean and real-register corpora. |
| Montenegrin | cnr |
ioDiacriticsSerbian |
Serbian pack | Limited | Shared BCS/Serbian forms restore; Montenegrin-specific ś/ź letters are preserved, not normalized away. |
| Serbian | sr / srp |
ioDiacriticsSerbian |
156,931 | 93.7% | First-tier pack; validated on real Serbian UGC/forum text. |
The important trust metric is edit precision: of the edits the library makes, how many are right. The current BCS packs measure around 99.5-100% edit precision depending on language/register. See RESEARCH.md and research/ for details.
Language neutrality matters for Bosnian, Croatian, and Serbian alike — none is singled out. All three are parallel first-tier language packs: separate dictionaries, separate profiles, separate resolve tables, and separate validation metrics, listed alphabetically throughout. No pack is the canonical/base language for the others. (Serbian additionally has two alphabets — Cyrillic and Latin — but that is a per-language note, not a privileged status.) Montenegrin is currently supported in a limited compatibility mode through the Serbian pack.
ioDiacritics is the software artifact of a larger empirical study on restoring stripped
diacritics in the Serbo-Croatian / BCS macrolanguage (ISO 639-3 hbs). The package is not a
toy word map: the shipped dictionaries and guard rules were built through repeated corpus
collection, dictionary growth, ablation, error analysis, and independent validation.
Current shipped data volume and validation scope:
- 427,945 reverse-index keys across the three bundled language packs.
- 2,455 confident ambiguity-resolution rules: Bosnian 481, Croatian 578, Serbian 1,396 (Serbian trimmed from 5,150 in v0.8.2 — only the dead tail; recall unchanged).
- Croatian was built to 171,938 candidate keys before being frequency-trimmed to 156,922 shipped keys; Bosnian was grown from 11,350 to 114,092 shipped keys.
- The shipped reliability passports cover 10,562 documents/posts and 21,185 fixable word instances across Bosnian, Croatian, and Serbian.
- A balanced language-register matrix covers Bosnian, Croatian, and Serbian across wiki, news, and Tatoeba cells: 11,310 documents and 30,573 fixable word instances under one comparable protocol.
- Validation uses document-level bootstrap confidence intervals (
B=2000, seed12345), resampling documents rather than individual words. - Raw corpora are intentionally not bundled; reports and reproducibility notes live in research/ and tools/.
A scientific paper describing the method, datasets, validation protocol, and results is in preparation. A link/citation will be added here after publication.
These are per-pack deployment numbers. They summarize each shipped language pack on its own validation corpus, so they are useful for product confidence but should not be read as a language ranking.
| Pack | Dictionary keys | Resolve rules | Validation sample | Recall | Edit precision | OOV |
|---|---|---|---|---|---|---|
| Bosnian | 114,092 | 481 | 943 clean Tatoeba docs / 1,332 fixable words | 94.9% [93.7, 96.1] | ≥99.8%* | 0.9% |
| Croatian | 156,922 | 578 | 3,614 clean Tatoeba docs / 5,510 fixable words | 96.1% [95.6, 96.6] | ≥99.9% (99.98%) | 0.5% |
| Serbian | 156,931 | 1,396 | 6,005 real UGC/forum posts / 14,343 fixable words | 93.7% | ~99.6% | measured in validation reports |
* Bosnian: 0 wrong edits observed on 1,332 fixable words — a tight upper bound (rule-of-three ≤~0.2% at 95% CI), not a literal 100%-precision guarantee.
For a fair Bosnian/Croatian/Serbian comparison, use the symmetric matrix: one protocol, matched registers, resolve off, independent corpora, document-level bootstrap CI. This is the main neutrality check.
| Register | Bosnian recall | Croatian recall | Serbian recall | Edit precision |
|---|---|---|---|---|
| Wiki | 86.9% [85.7, 87.9] | 87.4% [86.2, 88.5] | 84.1% [82.9, 85.3] | 99.8-100% |
| News | 88.0% [87.0, 89.0] | 89.4% [88.5, 90.3] | 82.2% [81.1, 83.3] | 99.4-99.9% |
| Tatoeba | 86.0% [83.9, 87.9] | 86.6% [84.7, 88.5] | 86.7% [84.6, 88.7] | 99.9-100% |
Main conclusion: when register, protocol, and evaluation setup are held fixed, the engine is not Serbian-first or Croatian-first. The differences are mostly dictionary coverage/register effects, while edit precision stays near 100%.
| Experiment | Before | After | Meaning |
|---|---|---|---|
| Bosnian wiki dictionary growth | 60.4% at 11,350 keys | 86.9% at 114,092 keys | Bosnian was coverage-limited, not algorithm-limited. |
| Croatian wiki dictionary growth | 82.6% at 78,562 keys | 87.4% at 156,922 keys | More validated coverage improved recall without hurting precision. |
| Croatian chat resolve table | 80.0% resolve off | 89.5% resolve on | Confident ambiguity rules add about +9.5pp recall. |
| Serbian chat resolve table | 81.9% resolve off | 91.4% resolve on | Same +9.5pp effect under the same protocol. |
Serbian real human stripped text (SentiComments.SR) |
82.8% resolve off | 93.9% resolve on | Synthetic shaving does not overstate Serbian real-error performance. |
A naive most-frequent baseline can raise recall but makes many more wrong edits. On Serbian
chat it measured 85.0% recall but only 89.5% edit precision; ioDiacritics with
resolve measured 91.4% recall and 99.7% edit precision on the same cell. This is why
the package optimizes for edit precision rather than raw recall alone.
Requirements:
- Swift 5.9+
- macOS 13+ / iOS 15+
- Swift Package Manager
Add the package to your Package.swift:
dependencies: [
.package(url: "https://github.com/ilya000/ioDiacritics.git", from: "0.9.3")
]Then depend on only the language packs you need:
.target(
name: "YourApp",
dependencies: [
.product(name: "ioDiacriticsBosnian", package: "ioDiacritics"),
.product(name: "ioDiacriticsCroatian", package: "ioDiacritics"),
.product(name: "ioDiacriticsSerbian", package: "ioDiacritics")
]
)For local development:
.package(path: "../ioDiacritics")The C++ port lives in cpp/. It is a dependency-free C++17 library with a CMake build. It uses the same shipped JSON dictionaries as the Swift package, so the linguistic data and quality numbers stay shared.
Build on Linux/macOS:
cmake -S cpp -B build-cpp -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp
ctest --test-dir build-cppBuild on Windows with Visual Studio:
cmake -S cpp -B build-cpp -G "Visual Studio 17 2022" -DIODIACRITICS_CPP_BUILD_TESTS=ON
cmake --build build-cpp --config Release
ctest --test-dir build-cpp -C ReleaseMinimal C++ usage:
#include <iodiacritics/iodiacritics.hpp>
using iodiacritics::Restorer;
auto sr = Restorer::load_file(
"Sources/ioDiacriticsSerbian/Resources/deshishana_sr.json",
iodiacritics::serbian_profile());
auto fixed = sr.restore_prepared_text("Drzava takodje moze.");
// Država takođe može.The C++ API mirrors the Swift engine:
restore(...)for one token;restore_text(...)for streaming/live text with left-context numeric guard;restore_prepared_text(...)for completed text with left and right numeric context;is_language(...)when loading the optional invariant word set.
import ioDiacriticsBosnian
let text = "Drzava takodjer moze."
let restored = Bosnian.shared?.restorePreparedText(text)
print(restored ?? text)
// Država također može.Croatian:
import ioDiacriticsCroatian
let text = "nasa drzava"
let restored = Croatian.shared?.restorePreparedText(text)
print(restored ?? text)
// naša državaSerbian:
import ioDiacriticsSerbian
let text = "Drzava takodje moze."
let restored = Serbian.shared?.restorePreparedText(text)
print(restored ?? text)
// Država takođe može.Each language pack exposes a shared lazy restorer and measured stats:
Bosnian.shared
Croatian.shared
Serbian.shared
Bosnian.stats.summary
Croatian.stats.summary
Serbian.stats.summaryUse restorePreparedText(_:) when the whole text is available, such as clipboard content,
imported text, files, notes, or a text area after editing.
let fixed = Serbian.shared?.restorePreparedText("sto evra")
// "sto evra" because `sto` can mean 100 and `evra` is visible on the rightPrepared-text mode can look both left and right for numeric/measure context.
Use restoreText(_:) for streaming/live transforms where the future is not available yet.
let fixed = Serbian.shared?.restoreText("100 sto")
// "100 sto"
let live = Serbian.shared?.restoreText("sto evra")
// "što evra" because live mode intentionally cannot see right contextThis is useful for keyboard-like workflows and preserves the original live-input behavior.
Use restore(_:prevWord:nextWord:isForeignWord:) when your app already tokenizes text.
let word = Serbian.shared?.restore("zelim")
// "želim"
let guarded = Serbian.shared?.restore("sto", prevWord: "100")
// nil: pass through unchangednil means "leave the token unchanged." That makes call sites simple and keeps the engine
precision-first.
Neural systems such as BERT or ByT5 can reach very high accuracy for diacritics restoration,
especially on clean benchmark corpora. ioDiacritics solves a different product problem:
- it is tiny compared with an ML model
- it loads locally inside a Swift app
- it does not need GPU/CPU-heavy inference
- it is deterministic and testable
- it is easy to ship in keyboard and clipboard utilities
- it prefers missed restorations over wrong edits
For end-user input tools, a conservative wrong-edit rate is often more important than squeezing out the last point of recall.
The engine builds a reverse index from a stripped, "bald" surface form to valid diacritic candidates:
| Class | Example | Behavior |
|---|---|---|
| Invariant | telefon |
Leave unchanged. |
| Deterministic | zelim -> želim |
Restore directly. |
| Ambiguous | casa -> časa/čaša/... |
Use frequency order when safe. |
| Valid bald homograph | sto vs što |
Usually hold, unless a confident resolve rule applies. |
Important details:
đcan be typed asdjord, so dictionary keys are expanded at build time.džis handled as a digraph.čandćboth collapse toc, so ambiguity is real.- Numeric guard protects words like
stonear numbers and measure words. - The runtime hot path is a dictionary lookup plus small rule checks.
| Target | Contents |
|---|---|
ioDiacritics |
Generic engine: Restorer, LanguageProfile, LangStats, version. |
ioDiacriticsBosnian |
Bosnian profile, bundled dictionary, stats, Bosnian.shared. |
ioDiacriticsCroatian |
Croatian profile, bundled dictionary, stats, Croatian.shared. |
ioDiacriticsSerbian |
Serbian profile, bundled dictionary, stats, Serbian.shared. |
The core target has no resources and no UI. A consumer depends only on the language packs it ships, so an app does not bundle dictionaries it does not use.
Each language pack includes a LangStats value:
print(Serbian.stats.summary)This gives apps an About-panel friendly reliability summary: dictionary size, recall, wrong-edit rate, edit precision, and validation corpus size. Drift tests ensure the published dictionary key counts stay in sync with the bundled JSON resources.
Most Latin-script languages need data plus a small profile:
- Build a reverse dictionary with
tools/build_deshishana.py. - Add language-specific strip rules in
LanguageProfile. - Add a thin
ioDiacritics<Language>target with bundleddeshishana_<code>.json. - Validate by shaving accented text, restoring it, and comparing against the original.
- Publish
LangStatswith recall/edit-error numbers.
Known special cases:
- Turkish needs careful
ı/ihandling. - German has digraph conventions such as
ä -> ae,ß -> ss. - Vietnamese likely needs a stronger context model before dictionary-only restoration is good enough.
See BACKLOG.md for roadmap notes.
ioDiacritics is not the first diacritics restoration system. It is deliberately a small
offline Swift package for product use.
Related work includes:
turanjanin/serbian-language-tools: PHP library for Serbian transliteration and diacritic restoration.clarinsi/redi: Croatian, Serbian, and Slovene restoration tool with optional language models.- BERT/Transformer-based diacritics restoration research for many languages.
The distinguishing feature here is the packaging and product shape: SwiftPM, bundled BCS language packs, no network, no AI model, high edit precision, and keyboard/clipboard-friendly APIs.
- CHANGELOG.md: version history.
- RESEARCH.md: paper-facing architecture and validation snapshot.
- research/: validation reports and corpus matrix notes.
- tools/: dictionary build and evaluation scripts.
- NOTICE.md: attribution and bundled-data notices.
- docs/DATA_LICENSE_AUDIT.md: data licensing audit.
Swift source code, tests, scripts, and documentation are licensed under the MIT License. See LICENSE.
The bundled JSON dictionaries are generated data artifacts derived from third-party lexical and corpus resources. They have separate provenance and attribution requirements. See NOTICE.md and docs/DATA_LICENSE_AUDIT.md before publishing or shipping a public binary that includes the dictionaries.
ioDiacritics is provided "as is", without warranties of any kind, to the fullest extent permitted by applicable law. By downloading or using it you accept the terms of use.
