This library contains data validation and cleaning routines that are meant to be precise. These methods are supported by resource data. The resource data is derived from real-world datasets and meant to cover common cases of data variations in corporate registries, and AML/KYC screening data.
rigour is a hybrid Python + Rust library. The Python package under
rigour/ is the public surface; a Rust crate under rust/
(rigour-core) is compiled by maturin into the private
rigour._core extension module. Importers should always go through
the nominated Python modules (rigour.text.*, rigour.names.*,
rigour.territories.*, rigour.addresses.*, rigour.ids.*, …);
rigour._core is an implementation detail and may change shape
between versions.
| Area | Python entry point | Rust module |
|---|---|---|
Phonetics (metaphone, soundex) |
rigour.text.phonetics |
rust/src/text/phonetics.rs |
Unicode scripts (codepoint_script, text_scripts, can_latinize) |
rigour.text.scripts |
rust/src/text/scripts.rs |
Transliteration (should_ascii, maybe_ascii — narrow, 6 admitted scripts) |
rigour.text.translit |
rust/src/text/translit.rs |
Flag-based normalisation (normalize, Normalize, Cleanup) |
rigour.text.normalize |
rust/src/text/normalize.rs |
Tokenisation (tokenize_name — Rust version) |
— (internal; Python copy still lives at rigour.names.tokenize.tokenize_name) |
rust/src/text/tokenize.rs |
Numeric-string parsing (string_number) |
rigour._core.string_number (no Python wrapper) |
rust/src/text/numbers.rs |
Org-type replacement (replace_org_types_compare / _display, remove_org_types, extract_org_types) |
rigour.names.org_types |
rust/src/names/org_types.rs + matcher.rs |
| AC name tagger | — (internal; called from analyze_names) |
rust/src/names/tagger.rs |
Name analysis pipeline (analyze_names, pair_symbols) |
rigour.names.analyze, rigour.names.symbol |
rust/src/names/analyze.rs + pairing.rs |
Name picking (pick_name, pick_case, reduce_names) |
rigour.names.pick |
rust/src/names/pick.rs |
Symbol / SymbolCategory |
rigour.names.symbol (re-exports Rust class) |
rust/src/names/symbol.rs |
| Territories database | rigour.territories.* |
rust/src/territories.rs (reads the JSONL blob) |
Broad-script transliteration (Han / Arabic / Devanagari / …) is
deliberately not in Rust — internal callers in rigour.names.part,
rigour.addresses.normalize, rigour.ids.strict,
rigour.territories.util call normality.ascii_text /
normality.latinize_text directly. rigour.text.translit.maybe_ascii
is the minimal self-contained primitive over Latin / Cyrillic / Greek
/ Armenian / Georgian / Hangul; non-admitted scripts identity-pass
(drop=False) or return empty (drop=True).
Source of truth lives under resources/ (YAML + a few text blobs).
genscripts/ regenerates per-consumer artifacts under rust/data/
(e.g. rust/data/names/stopwords.json,
rust/data/territories/data.jsonl,
rust/data/names/person_names.txt) — these are committed. Large
blobs (person-names corpus, territories JSONL) get zstd-compressed
at crate-build time by rust/build.rs and embedded into the binary
via include_bytes!.
The Python-side rigour/data/ directory is mostly drained: the
wordlist / stopword / ordinals / names / territories / org-types
tables all moved to Rust and are read via rigour._core.*_list() /
rigour._core.*_dict() accessors. Two Python data modules are still
there because their ports haven't happened yet: rigour/data/langs/ iso639.py (ISO-639 tables for rigour.langs) and
rigour/data/addresses/ (format templates for rigour.addresses).
Treat them the same as any Python source — not a pattern to
replicate for new data.
make rust-data regenerates everything under rust/data/ from
resources/. CI runs it and fails on diff, so after editing anything
under resources/ you need to run make rust-data and commit the
regenerated artifacts.
make develop→maturin develop --releaserebuilds the extension into your active venv. Always use release builds for benchmarks and any performance testing — ICU4X's trie-heavy paths are ~100× slower in debug.make develop-debugis the fast-iteration alternative when you don't care about runtime speed of the compiled code.- After editing
rust/src/**, rebuild withmake develop(ormake develop-debug) before runningpytest/mypy, otherwise tests will be running against the previous.so. rigour/_core.pyiis the type stub for the extension; update it whenever you add / change a PyO3 entry point.rigour/py.typedalready marks the package typed for downstreammypy --strict.
- When supplementing the YAML resources in this directory, always prioritise precision over quantity.
- In alias/symbol mappings, make sure to include only aliases that would be commonly used in a business database.
- In mappings where common misspellings or variations are supplied, propose variants found in contractual documents, such as partial abbreviations, misspellings, simplifications, etc.
- Our resources should target always supporting the following languages: English, French, Spanish, Russian, Ukrainian, Arabic, Simplified Chinese, Korean, Japanese, Portuguese (Brazilian and European), Turkish, Polish, German, Swedish, Norwegian, Danish, Lithuanian, Estonian, Finnish, Hungarian, Dutch
- Generate fully-typed, minimal Python code.
- Always explicitly check
if x is None:, notif x: - Run tests using
pytest --cov rigour - Run typechecking using
mypy --strict rigour
- Run
make rust-fmtbefore committing any Rust changes — CI runscargo fmt --checkand fails on drift.make rust-fmt-checkmirrors the CI check locally. - Run
cargo clippy --all-targets -- -D warnings(both with and without--features python) before committing — CI treats warnings as errors.
Public API docs are built with mkdocs Material + mkdocstrings (Python
handler). The docs/*.md pages are thin — most of them just do
::: rigour.module.name and let mkdocstrings pull docstrings from
the source. Which means the docstring is the docs: there is no
separate hand-written reference.
- Use Google-style docstrings: prose intro, then explicit
Args:,Returns:, optionallyRaises:sections. - Document every parameter a public function takes — including the obvious ones. The Args section renders as a Parameters table in the mkdocs site, and skipped params show as blank rows.
- For classes (including
IntFlag/IntEnum), document members under anAttributes:section in the class docstring — it renders as a named table with descriptions. - Cross-reference other rigour symbols with mkdocs-autorefs
Markdown syntax:
[Display][full.dotted.path]or the shorthand[full.dotted.path][]when you want the name itself as the label. Example:[Normalize][rigour.text.normalize.Normalize]or just[rigour.text.normalize][]for a module. Sphinx-style:class:/:func:roles do not render here — they come out as literal text. The Type column of a Parameters table auto-links type annotations without any markup, so description text rarely needs an explicit link — plain backticks are fine when you just want code formatting. - Prefer one canonical reference per concept. When many functions
share a parameter with non-obvious semantics (e.g.
normalize_flags,cleanup), write the long-form explanation once in the owning module's docstring and have individual function docstrings just link to it via:class:/:mod:. This is howrigour.names.org_typesdefers torigour.text.normalizefor the flag vocabulary. - Verify rendering before landing a docs-affecting change:
mkdocs build --strict -f mkdocs.yml(treats warnings as errors). If a new submodule is added, remember to add::: rigour.my.submoduleto the relevantdocs/*.md— mkdocstrings doesn't auto-recurse.