|
1 | | -# About *B3DB* |
2 | | - |
3 | | -In this repo, we present a large benchmark dataset, [Blood-Brain Barrier Database (B3DB)](https://www.nature.com/articles/s41597-021-01069-5), compiled |
4 | | -from 50 published resources (as summarized at |
5 | | -[raw_data/raw_data_summary.tsv](raw_data/raw_data_summary.tsv)) and categorized based on |
6 | | -the consistency between different experimental references/measurements. This dataset was [published in Scientific Data](https://www.nature.com/articles/s41597-021-01069-5) and this repository is occasionally uploaded with new experimental data. Scientists who would like to contribute data should contact the database's maintainers (e.g., by creating a new Issue in this database). |
7 | | - |
8 | | -A subset of the |
9 | | -molecules in B3DB has numerical `logBB` values (1058 compounds), while the whole dataset |
10 | | -has categorical (BBB+ or BBB-) BBB permeability labels (7807 compounds). Some physicochemical properties |
11 | | -of the molecules are also provided. |
12 | | - |
13 | | -## Citation |
14 | | - |
15 | | -Please use the following citation in any publication using our *B3DB* dataset: |
16 | | - |
17 | | -```md |
18 | | -@article{Meng_A_curated_diverse_2021, |
19 | | -author = {Meng, Fanwang and Xi, Yang and Huang, Jinfeng and Ayers, Paul W.}, |
20 | | -doi = {10.1038/s41597-021-01069-5}, |
21 | | -journal = {Scientific Data}, |
22 | | -number = {289}, |
23 | | -title = {A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors}, |
24 | | -volume = {8}, |
25 | | -year = {2021}, |
26 | | -url = {https://www.nature.com/articles/s41597-021-01069-5}, |
27 | | -publisher = {Springer Nature} |
28 | | -} |
29 | | -``` |
30 | | - |
31 | | -## Features of *B3DB* |
32 | | - |
33 | | -1. The largest dataset with numerical and categorical values for Blood-Brain Barrier small molecules |
34 | | - (to the best of our knowledge, as of February 25, 2021). |
35 | | - |
36 | | -2. Inclusion of stereochemistry information with isomeric SMILES with chiral specifications if |
37 | | - available. Otherwise, canonical SMILES are used. |
38 | | - |
39 | | -3. Characterization of uncertainty of experimental measurements by grouping the collected molecular |
40 | | - data records. |
41 | | - |
42 | | -4. Extended datasets for numerical and categorical data with precomputed physicochemical properties |
43 | | - using [mordred](https://github.com/mordred-descriptor/mordred). |
44 | | - |
45 | | -## Usage |
46 | | - |
47 | | -There are two types of dataset in [B3DB](B3DB), [regression data](B3DB/B3DB_regression.tsv) |
48 | | -and [classification data](B3DB/B3DB_classification.tsv) and they can be loaded simply using *pandas*. For example |
49 | | - |
50 | | -```python |
51 | | -import pandas as pd |
52 | | - |
53 | | -# load regression dataset |
54 | | -regression_data = pd.read_csv("B3DB/B3DB_regression.tsv", |
55 | | - sep="\t") |
56 | | - |
57 | | -# load classification dataset |
58 | | -classification_data = pd.read_csv("B3DB/B3DB_classification.tsv", |
59 | | - sep="\t") |
60 | | - |
61 | | -# load extended regression dataset |
62 | | -regression_data_extended = pd.read_csv("B3DB/B3DB_regression_extended.tsv.gz", |
63 | | - sep="\t", compression="gzip") |
64 | | - |
65 | | -# load extended classification dataset |
66 | | -classification_data_extended = pd.read_csv("B3DB/B3DB_classification_extended.tsv.gz", |
67 | | - sep="\t", compression="gzip") |
68 | | - |
69 | | -``` |
70 | | - |
71 | | -We also have three examples to show how to use our dataset, |
72 | | -[numerical_data_analysis.ipynb](notebooks/numerical_data_analysis.ipynb), |
73 | | -[PCA_projection_fingerprint.ipynb](notebooks/PCA_projection_fingerprint.ipynb) and |
74 | | -[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb). |
75 | | -[PCA_projection_descriptors.ipynb](notebooks/PCA_projection_descriptors.ipynb) uses precomputed |
76 | | -chemical descriptors for visualization of chemical space of `B3DB`, and can be used directly |
77 | | -using *MyBinder*, |
78 | | -[](https://mybinder.org/v2/gh/theochem/B3DB/main?filepath=notebooks%2FPCA_projection_descriptors.ipynb). |
79 | | -Due to the difficulty of installing `RDKit` in *MyBinder*, only `PCA_projection_descriptors. |
80 | | -ipynb` is set up in *MyBinder*. |
| 1 | +# Data Curation Process of B3DB |
81 | 2 |
|
82 | 3 | ## Working environment setting up |
83 | 4 |
|
@@ -136,21 +57,3 @@ pip install -r requirements.txt |
136 | 57 | ``` |
137 | 58 |
|
138 | 59 | `ALOGPS` version 2.1 can be accessed at http://www.vcclab.org/lab/alogps/. |
139 | | - |
140 | | -The materials and data under this repo are distributed under the |
141 | | -[CC0 Licence](http://creativecommons.org/publicdomain/zero/1.0/). |
142 | | - |
143 | | -## Update: New External Dataset Available |
144 | | - |
145 | | -We’ve expanded the B3DB dataset by adding a new file: `B3DB_classification_external.tsv`. This file introduces additional compounds (171 BBB+ and 4 BBB-) that were not present in the original B3DB dataset. These compounds were carefully selected and incorporated to further enrich B3DB. |
146 | | - |
147 | | -### Usage |
148 | | - |
149 | | -To load and work with the new classification data in Python, you can use the following code snippet: |
150 | | - |
151 | | -```python |
152 | | -import pandas as pd |
153 | | - |
154 | | -# Load the new external classification dataset |
155 | | -external_classification_data = pd.read_csv("B3DB/B3DB_classification_external.tsv", sep="\t") |
156 | | -``` |
0 commit comments