Cleanup works + bug fixes by xiaoruiDong · Pull Request #15 · PattanaikL/GeoMol

xiaoruiDong · 2023-04-19T02:55:06Z

Motivation

Recently, I was trying to enable the GeoMol embedder in RDMC.conformer_generation on my M2 Macbook. Since I am using a completely different environment with different versions, inevitably, I ran into several issues (e.g., issue #14 and dimension mismatch errors) and tried to fix them.

Environment (most relevant dependencies)

MacOS Ventura 13.3.1
python 3.8.16 (conda-forge)
pytorch 1.13.1 (pytorch)
rdkit 2022.09.4 (conda-forge)
pytorch_geometric 2.2.0 (local build)
torch-scatter 2.1.0 (pip)
torch-sparse 0.6.16 (pip)
torchmetrics 0.11.1 (conda-forge)

Changes and Why in such a way

In order to understand what went wrong, I first cleaned up the featurization module (major effort) and then targeted two major causes 1. Batch.from_data_list performs differently than expected. 2. there are changes in the newer version of PyG.

For (1), before the discussion, I need to say that I only tried Batch.from_data_list to form a batch of data, and all discussions are based on that. In my trials, I first featurize molecules into torch_geometric.Data (data1, data2, ...) and then use Batch.from_data_list([data1, data2, ...]) to naively construct a Batch (usually termed data in this repo). According to the function utils.get_neighbor_ids, data.neighbors is expected to be a list of dict. But, I found what Batch.from_data_list does is taking the keys in the data1.neighbors and extending the values according to data2.neighbors, data3.neighbors, ..... So it actually ends up with a single dict object. I tried to identify how such differences are introduced but failed, and I ended up creating a customized from_data_list in commit "f341404".

   if tg_version_ge_2:
        batch_data = Batch.from_data_list(data_list,
                                          exclude_keys=['neighbors'])
        batch_data.neighbors = [d.neighbors for d in data_list]
    else:
        batch_data = Batch.from_data_list(data_list)
    return batch_data

To avoid introducing new bugs, I used version>2 (therefore, won't influence version < 2, which is recommended) identifying whether a special treatment to data.neighbors is needed. However, I believe the action in the if can also be applied to version < 1 without issues.

For (2), I found a dimension mismatch when calling (h_mol = self.h_mol_mlp(global_add_pool(x2, batch)) in model.py line 248). It is due to PR#4827 in PyG changing how scatter is called in global_add_pool. In new implementations, dim now defaults to -2 instead of 0. I simply avoid using global_add_pool but directly call scatter to avoid the issue despite the versions.

Other comments

I modified featurization.py a lot, mostly to increase readability for myself. I also changed module names and added a few miscellaneous files for my convenience in importing modules. I can create another PR that only contains the change (1) and (2) I highlighted above if you think it is more appropriate.

This can be useful for other softwares using rdkit

PR#4827 in PYG changes how scatter is called, defaults to the dim=-2 instead of 0. This change directly calls scatter instead of global_add_pool to avoid the incompatibility.

1. Modularization 2. able to user provided CUDA version 3. Found a workaround for Mchip Mac 4. Remove the constraints for deps 5. backup the original dep constraints in environment_reproduce.yml

Originally, model is assigned to GPU whenever possible. This causes conflicts when using a model on a CPU but on a machine with GPU.

angle_mask_ref and angle_combos are predefined without awareness of where the operation will be. Add a temporary workaround.

Previously, inference's device are kind of hardcoded. This commit makes sure that device are corrected assigned, so that you can run inference on CPU or GPU on a machine with GPU or inference on CPU on a machine without GPU

Molecule CN1C2=C(C=C(C=C2)Cl)C(=NCC1=O)C3=CC=CC=C3 helps identify this problem

xiaoruiDong added 28 commits April 18, 2023 12:44

Add a setup file for the project

02378a0

Utilize unused imports in featurization

3079937

clean up imports in featurization

9563cbe

Move constant variables to the top in featurization

79c122f

Simplify dataset definition and reference

e5959c1

Add two convenient function check_mol and smiles_to_mol

2f1dbdb

Add a helper function to create necessary features

9af9a8d

Add a function to featurize molecule given a RDKit Mol object

2f51656

This can be useful for other softwares using rdkit

Bugfix: featurize mol missing returns

aa04db4

Simplify featurize_mol_from_smiles

1616e53

Simplify featurize_mol method of geom_confs

ae796e5

Add from_data_list to avoid errors in neighbors in pyg 2

f341404

Reame model to geomol

398897b

Rename model to geomol continue

0b0b64c

Adjust for the change in global_add_pool

39a60fb

PR#4827 in PYG changes how scatter is called, defaults to the dim=-2 instead of 0. This change directly calls scatter instead of global_add_pool to avoid the incompatibility.

update generate_confs with from_data_list

5e3cb5a

Add a dummy __init__ file to root for easy external import

a59e38b

AutoPEP8 correction

6a87769

Ignore Mac .DS_Store

cf1daef

Update the environment setup script

6bfda9c

1. Modularization 2. able to user provided CUDA version 3. Found a workaround for Mchip Mac 4. Remove the constraints for deps 5. backup the original dep constraints in environment_reproduce.yml

Add a convenient path to the trained models

acdc7a8

Add YAML as requirement

50475e0

Decouple model device assignment with available device

9d6f768

Originally, model is assigned to GPU whenever possible. This causes conflicts when using a model on a CPU but on a machine with GPU.

Workaround for batch_angles_from_coords

563ef0c

angle_mask_ref and angle_combos are predefined without awareness of where the operation will be. Add a temporary workaround.

introduce more accurate device control during inference

d8d4058

Previously, inference's device are kind of hardcoded. This commit makes sure that device are corrected assigned, so that you can run inference on CPU or GPU on a machine with GPU or inference on CPU on a machine without GPU

Make the model device agnotic

f8cccd4

Fix shapes for dihedral_pairs-related operations in model

689d572

Fix device mismatch during inference for a ring molecule

70341a4

Molecule CN1C2=C(C=C(C=C2)Cl)C(=NCC1=O)C3=CC=CC=C3 helps identify this problem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup works + bug fixes#15

Cleanup works + bug fixes#15
xiaoruiDong wants to merge 28 commits into
PattanaikL:mainfrom
xiaoruiDong:main

xiaoruiDong commented Apr 19, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiaoruiDong commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Environment (most relevant dependencies)

Changes and Why in such a way

Other comments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiaoruiDong commented Apr 19, 2023 •

edited

Loading