A metal-organic framework (MOF) recommendation system based on Doc2Vec.
We suggest you install the package into a seperate conda environment with Python 3.8 or higher. Follow the steps below:
$ conda create -n mof2vec python=3.8 -y
$ conda activate mof2vec
$ git clone https://github.com/XiaoqZhang/mofgraph2vec.git
$ cd mofgraph2vec
$ python -m pip install -e .
- To prepare the data, you need a folder that contains a folder with all MOF CIFs and a
.csvfile with geometry and topology information. Examples are shown inexample_data. - Change the configurations in
conf/. See more details in Configuration parameters. - Navigate to the folder
experiments/. - Run the model by
$ python train.py. The first time of featurization may take some time.
- The pre-trained models for ARC-MOF and QMOF databases are attached in Releases.
- Example of loading the pre-trained models is provided in
dev/example.ipynb.
- Examples are given in
dev/example.ipynb.
You can easily tune the model parameters in conf folder.
mof2vec_data: Configuration specific to the MOF embedding data.mof2vec_model: Configuration related the MOF embedding model.doc2label_data: Configuration for the data for training the downstream regression model.doc2label_model: Configuration for the downstream regression model.sweep: Wandb hyperparameter sweeping configuration.
This section contains setting for wandb logging and tracking experiments.
Fix random seed, ensuring reproducibility in experiments.
A boolean indicating whether wandb hyperparameter sweeping is enabled or not.
mof2vec: Only run MOF embedding.doc2lable: Only run downstream regression model. In this mode,pretrainingshould be set toFalseand MOF embeddings should be provided inembedding_pathin the configuration filedoc2label_data/default.yaml.workflow: Run the MOF embedding and use the embedding as features for the downstream regression model.
cif_path: The path to the folder that contains all the.ciffiles to embed.embed_label: A boolean indicating whether to parsing geometric properties of MOFs.label_path: Ifembed_labelis set to True. The.csvfile that contains the geometry information should be provided.descriptors_to_embed: The numeric columns to parse in the.csvfile.category_to_embed: The category columns to parse in the.csvfile.id_column: The column that contains the cif names.wl_step: The number of steps in extracting the rooted substructures.
vector_size: Dimensionality of the embeddings.window: The maximum distance between the current and predicted word within a MOF document.min_count: Ignores all words with total frequency lower than this.dm: Defines the training algorithm. If dm=1, ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.sample: The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).workers: Use these many worker threads to train the model (=faster training with multicore machines).alpha: The initial learning rate.epochs: Number of iterations (epochs) over the corpus. More parameters can be added. Find the details in gensim.models.doc2vec.Doc2Vec
pretraining: If setTrue, the MOF embedding model is trained from scratch. If set toFalse, load MOF embeddings from pre-trained models. In this case, eitherembedding_model_pathorembedding_pathshould be provided.embedding_model_path: Pretrained embedding model path.embedding_path: MOF embeddings from pre-trained models.label_path: The path to the.csvfile that contains the downstream regression data.task: The task column for the downstream regression model.MOF_id: The column that contains MOF names.train_frac: The training data size.test_frac: The test data size.
Specify whether to turn on a Grid search or not. If True, all the values in params should be provided as list. The grid search is performed with a 5-fold cross-validation. If False, all the values should be float or int.
The hyperparameters for the supervised XGBoost model.
Number of jobs to run in parallel.
This project is licensed under the MIT License. See the LICENSE file for more information.