Convert raw documents to a matrix of TF-IDF features.
- Armadillo, g++, boost
sudo apt install g++ libboost-all-dev libarmadillo-dev
g++ main.cc src/tfidf_vectorizer.cc -larmadillo -std=c++11 && ./a.out
- Tokenizes raw documents.
- Work with both tf-idf and binary values.
- Can use a selected number of features (the ones with highest idf).
- Similar interface to sklearn: fit, transform and fit_transform methods, as well as idf_ and vocabulary_ members. However, this is not a port from sklearn TfidfVectorizer, but it tries to mimic sklearn. The example given here produces the same tfidf matrix as sklearn in https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html.
- Features are in rows, documents (objects) are in columns.
- This behavior is opposed to what is normally done in Python, but it is the default in C++ libraries such as MLPack.
- Install catch2
git clone https://github.com/catchorg/Catch2.git # somewhere else
cd Catch2
cmake -Bbuild -H. -DBUILD_TESTING=OFF
sudo cmake --build build/ --target install
- Run tests
cd tests/
g++ t1.cc -larmadillo -std=c++11 -o tests
./tests