C++ TfidfVectorizer

Convert raw documents to a matrix of TF-IDF features.

Requirements:

Armadillo, g++, boost

sudo apt install g++ libboost-all-dev libarmadillo-dev

Compiling and running example in main.cc:

g++ main.cc src/tfidf_vectorizer.cc -larmadillo -std=c++11 && ./a.out

Features:

Tokenizes raw documents.
Work with both tf-idf and binary values.
Can use a selected number of features (the ones with highest idf).
Similar interface to sklearn: fit, transform and fit_transform methods, as well as idf_ and vocabulary_ members. However, this is not a port from sklearn TfidfVectorizer, but it tries to mimic sklearn. The example given here produces the same tfidf matrix as sklearn in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.

Notes:

Features are in rows, documents (objects) are in columns.
This behavior is opposed to what is normally done in Python, but it is the default in C++ libraries such as MLPack.

Optional: unit tests

Install catch2

git clone https://github.com/catchorg/Catch2.git # somewhere else
cd Catch2
cmake -Bbuild -H. -DBUILD_TESTING=OFF
sudo cmake --build build/ --target install

Run tests

cd tests/
g++ t1.cc -larmadillo -std=c++11 -o tests
./tests

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
include		include
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.cc		main.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C++ TfidfVectorizer

Requirements:

Compiling and running example in main.cc:

Features:

Notes:

Optional: unit tests

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

C++ TfidfVectorizer

Requirements:

Compiling and running example in main.cc:

Features:

Notes:

Optional: unit tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages