| SPDX-FileCopyrightText | SPDX-License-Identifier |
|---|---|
2024-2026 PyThaiNLP Project |
Apache-2.0 |
A Thai natural language processing library written in Rust with optional
Python and Node.js bindings. Formerly known as oxidized-thainlp.
Using in a Rust project
cargo add nlpo3Using in a Python project
pip install nlpo3- Thai word tokenizer
- Uses a maximal-matching, dictionary-based tokenization algorithm
and respects Thai Character Cluster boundaries.
- Approximately 2.5× faster than the comparable pure-Python
implementation (PyThaiNLP's
newmm).
- Approximately 2.5× faster than the comparable pure-Python
implementation (PyThaiNLP's
- Load a dictionary from a plain text file (one word per line)
or from
Vec<String>
- Uses a maximal-matching, dictionary-based tokenization algorithm
and respects Thai Character Cluster boundaries.
See nlpo3-nodejs.
Example:
from nlpo3 import load_dict, segment
load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")See more at nlpo3-python.
To add nlpo3 to your project's dependencies:
cargo add nlpo3This updates Cargo.toml with:
[dependencies]
nlpo3 = "1.4.0"Create a tokenizer from a dictionary file and use it to tokenize a string (safe mode = true, parallel mode = false):
use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();Create a tokenizer from a vector of strings:
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);Add words to an existing tokenizer:
tokenizer.add_word(&["มิวเซียม"]);Remove words from an existing tokenizer:
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);Example:
echo "ฉันกินข้าว" | nlpo3 segmentSee more at nlpo3-cli.
- To keep the library small,
nlpO3does not include a dictionary; users should provide one when using the dictionary-based tokenizer.- A dictionary is required for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP
- ~62,000 words
- CC0-1.0
- word break dictionary from libthai
- consists of dictionaries in different categories, with a make script
- LGPL-2.1
- words_th.tx from PyThaiNLP
Generic test:
cargo testBuild API document and open it to check:
cargo doc --openBuild (remove --release to keep debug information):
cargo build --releaseCheck target/ for build artifacts.
- Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.