Skip to content

Thai natural language processing library in Rust, with Python and Node bindings.

License

Notifications You must be signed in to change notification settings

PyThaiNLP/nlpo3

Repository files navigation

SPDX-FileCopyrightText SPDX-License-Identifier
2024-2026 PyThaiNLP Project
Apache-2.0

nlpO3

crates.io Apache-2.0 DOI

A Thai natural language processing library written in Rust with optional Python and Node.js bindings. Formerly known as oxidized-thainlp.

Using in a Rust project

cargo add nlpo3

Using in a Python project

pip install nlpo3

Table of contents

Features

  • Thai word tokenizer
    • Uses a maximal-matching, dictionary-based tokenization algorithm and respects Thai Character Cluster boundaries.
      • Approximately 2.5× faster than the comparable pure-Python implementation (PyThaiNLP's newmm).
    • Load a dictionary from a plain text file (one word per line) or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

PyPI

Example:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

See more at nlpo3-python.

Rust library

crates.io

Add as a dependency

To add nlpo3 to your project's dependencies:

cargo add nlpo3

This updates Cargo.toml with:

[dependencies]
nlpo3 = "1.4.0"

Example

Create a tokenizer from a dictionary file and use it to tokenize a string (safe mode = true, parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer from a vector of strings:

let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Command-line interface

crates.io

Example:

echo "ฉันกินข้าว" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

  • To keep the library small, nlpO3 does not include a dictionary; users should provide one when using the dictionary-based tokenizer.
    • A dictionary is required for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Build

Requirements

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Develop

Development document

Issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.