nlpO3

SPDX-FileCopyrightText	SPDX-License-Identifier
2024-2026 PyThaiNLP Project	Apache-2.0

nlpO3

A Thai natural language processing library written in Rust with optional Python and Node.js bindings. Formerly known as oxidized-thainlp.

Using in a Rust project

cargo add nlpo3

Using in a Python project

pip install nlpo3

Features

Thai word tokenizer
- Uses a maximal-matching, dictionary-based tokenization algorithm and respects Thai Character Cluster boundaries.
  - Approximately 2.5× faster than the comparable pure-Python implementation (PyThaiNLP's newmm).
- Load a dictionary from a plain text file (one word per line) or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

Example:

from nlpo3 import load_dict, segment

load_dict("path/to/dict.file", "dict_name")
segment("สวัสดีครับ", "dict_name")

See more at nlpo3-python.

Rust library

Add as a dependency

To add nlpo3 to your project's dependencies:

cargo add nlpo3

This updates Cargo.toml with:

[dependencies]
nlpo3 = "1.4.0"

Example

Create a tokenizer from a dictionary file and use it to tokenize a string (safe mode = true, parallel mode = false):

use nlpo3::tokenizer::newmm::NewmmTokenizer;
use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file");
let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap();

Create a tokenizer from a vector of strings:

let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Command-line interface

Example:

echo "ฉันกินข้าว" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

To keep the library small, nlpO3 does not include a dictionary; users should provide one when using the dictionary-based tokenizer.
- A dictionary is required for the dictionary-based word tokenizer.
For tokenization dictionary, try
- words_th.tx from PyThaiNLP
  - ~62,000 words
  - CC0-1.0
- word break dictionary from libthai
  - consists of dictionaries in different categories, with a make script
  - LGPL-2.1

Build

Requirements

Rust 2018 Edition

Steps

Generic test:

cargo test

Build API document and open it to check:

cargo doc --open

Build (remove --release to keep debug information):

cargo build --release

Check target/ for build artifacts.

Develop

Development document

Notes on custom string

Issues

Please report issues at https://github.com/PyThaiNLP/nlpo3/issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 553 Commits
.cargo		.cargo
.github		.github
build_tools/github		build_tools/github
nlpo3-cli		nlpo3-cli
nlpo3-nodejs		nlpo3-nodejs
nlpo3-python		nlpo3-python
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
words_th.txt		words_th.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlpO3

Table of contents

Features

Use

Node.js binding

Python binding

Rust library

Add as a dependency

Example

Command-line interface

Dictionary

Build

Requirements

Steps

Develop

Development document

Issues

License

About

Uh oh!

Releases 15

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

PyThaiNLP/nlpo3

Folders and files

Latest commit

History

Repository files navigation

nlpO3

Table of contents

Features

Use

Node.js binding

Python binding

Rust library

Add as a dependency

Example

Command-line interface

Dictionary

Build

Requirements

Steps

Develop

Development document

Issues

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 15

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages