arXiv Paper Classifier

This is a classifier for arXiv papers. It is trained against ~50K papers from Kaggle and is able to classify the paper category, based on the title and/or abstract/description.

Details

The model consists of 4 Fully-Connected Linear layers, and performs classification across 154 categories. It consists of about 11M parameters. The training process utilizes backpropagation via the Adam optimizer and BCEWithLogitsLoss criterion. And introduces early stopping with patience = 5, when a loss larger than the already observed best loss is identified (meaning that the loss stopped minimizing).

Implementation

The tool is implemented in Python, utilizing the PyTorch API, as well as the ONNX API (for the case ONNX is selected as format). We utilize TF-IDF vectorization for the tokens.

Configuration

The system configuration is defined in config.json. In particular:

{
    "taxonomy_path": "<The dataset taxonomy relative path.>",
    "model_path": "<The base classifier model path. The extension is automatically added based on the -f parameter.>",
    "dataset_path": <The dataset relative path.>,
    "prediction_threshold": <The prediction threshold for inference.>
}

Instructions

Install

You will need Python 3.5+ and pip. Run pip install -r requirements.txt.

Train Model

You need to donwload the dataset, and place it in the datasets folder. After that, you can train the model by doing python main.py -t <Epochs> -f <Format: (ONNX | PyTorch)>. Once training is complete, the model will be automatically utilized for execution.

Run Model

To load the model without training after it is training, simply do python main.py -f <FORMAT>. If no format is specified, a PyTorch (.pth) model will be loaded. This will load the file.

Output Format

You can output the model by specifying the -f parameter, specifying either ONNX or PyTorch format, e.g., python main.py -t 100 -f ONNX, both for training and loading purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

arXiv Paper Classifier

Details

Implementation

Configuration

Instructions

Install

Train Model

Run Model

Output Format

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
classifier		classifier
dataset		dataset
helpers		helpers
README.md		README.md
config.json		config.json
main.py		main.py
requirements.txt		requirements.txt

luludak/ArXivClassifier

Folders and files

Latest commit

History

Repository files navigation

arXiv Paper Classifier

Details

Implementation

Configuration

Instructions

Install

Train Model

Run Model

Output Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages