Skip to content

Cleanup works + bug fixes#15

Open
xiaoruiDong wants to merge 28 commits into
PattanaikL:mainfrom
xiaoruiDong:main
Open

Cleanup works + bug fixes#15
xiaoruiDong wants to merge 28 commits into
PattanaikL:mainfrom
xiaoruiDong:main

Conversation

@xiaoruiDong

@xiaoruiDong xiaoruiDong commented Apr 19, 2023

Copy link
Copy Markdown

Motivation

Recently, I was trying to enable the GeoMol embedder in RDMC.conformer_generation on my M2 Macbook. Since I am using a completely different environment with different versions, inevitably, I ran into several issues (e.g., issue #14 and dimension mismatch errors) and tried to fix them.

Environment (most relevant dependencies)

  • MacOS Ventura 13.3.1
  • python 3.8.16 (conda-forge)
  • pytorch 1.13.1 (pytorch)
  • rdkit 2022.09.4 (conda-forge)
  • pytorch_geometric 2.2.0 (local build)
  • torch-scatter 2.1.0 (pip)
  • torch-sparse 0.6.16 (pip)
  • torchmetrics 0.11.1 (conda-forge)

Changes and Why in such a way

In order to understand what went wrong, I first cleaned up the featurization module (major effort) and then targeted two major causes 1. Batch.from_data_list performs differently than expected. 2. there are changes in the newer version of PyG.

For (1), before the discussion, I need to say that I only tried Batch.from_data_list to form a batch of data, and all discussions are based on that. In my trials, I first featurize molecules into torch_geometric.Data (data1, data2, ...) and then use Batch.from_data_list([data1, data2, ...]) to naively construct a Batch (usually termed data in this repo). According to the function utils.get_neighbor_ids, data.neighbors is expected to be a list of dict. But, I found what Batch.from_data_list does is taking the keys in the data1.neighbors and extending the values according to data2.neighbors, data3.neighbors, ..... So it actually ends up with a single dict object. I tried to identify how such differences are introduced but failed, and I ended up creating a customized from_data_list in commit "f341404".

   if tg_version_ge_2:
        batch_data = Batch.from_data_list(data_list,
                                          exclude_keys=['neighbors'])
        batch_data.neighbors = [d.neighbors for d in data_list]
    else:
        batch_data = Batch.from_data_list(data_list)
    return batch_data

To avoid introducing new bugs, I used version>2 (therefore, won't influence version < 2, which is recommended) identifying whether a special treatment to data.neighbors is needed. However, I believe the action in the if can also be applied to version < 1 without issues.

For (2), I found a dimension mismatch when calling (h_mol = self.h_mol_mlp(global_add_pool(x2, batch)) in model.py line 248). It is due to PR#4827 in PyG changing how scatter is called in global_add_pool. In new implementations, dim now defaults to -2 instead of 0. I simply avoid using global_add_pool but directly call scatter to avoid the issue despite the versions.

Other comments

I modified featurization.py a lot, mostly to increase readability for myself. I also changed module names and added a few miscellaneous files for my convenience in importing modules. I can create another PR that only contains the change (1) and (2) I highlighted above if you think it is more appropriate.

This can be useful for other softwares using rdkit
PR#4827 in PYG changes how scatter is called, defaults to the dim=-2 instead of 0. This change directly calls scatter instead of global_add_pool to avoid the incompatibility.
1. Modularization 2. able to user provided CUDA version 3. Found a workaround for Mchip Mac 4. Remove the constraints for deps 5. backup the original dep constraints in environment_reproduce.yml
Originally, model is assigned to GPU whenever possible. This causes conflicts when using a model on a CPU but on a machine with GPU.
angle_mask_ref and angle_combos are predefined without awareness of where the operation will be. Add a temporary workaround.
Previously, inference's device are kind of hardcoded. This commit makes sure that device are corrected assigned, so that you can run inference on CPU or GPU on a machine with GPU or inference on CPU on a machine without GPU
Molecule CN1C2=C(C=C(C=C2)Cl)C(=NCC1=O)C3=CC=CC=C3 helps identify this problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant