Bases: BaseDataset
COMP6 is a benchmark suite consisting of broad regions of bio-chemical and organic space developed for testing the
ANI-1x potential. It is curated from 6 benchmark sets: S66x8, ANI-MD, GDB7to9, GDB10to13, DrugBank, and
Tripeptides. Energies and forces for all non-equilibrium molecular conformations are calculated using
the wB97x density functional with the 6-31G(d) basis set. The dataset also includes Hirshfield charges and
molecular dipoles.
Details of the benchmark sets are as follows
S66x8: Consists of 66 dimeric systems involving hydrogen bonding, pi-pi stacking, London interactions and
mixed influence interactions.
ANI Molecular Dynamics (ANI-MD): Forces from the ANI-1x potential are used for running 1ns vacuum molecular
dynamics with a 0.25fs time step at 300K using the Langevin thermostat of 14 well-known drug molecules and 2 small
proteins. A random subsample of 128 frames from each 1ns trajectory is selected, and reference DFT single point
calculations are performed to calculate energies and forces.
GDB7to9: Consists of 1500 molecules where 500 per 7, 8 and 9 heavy atoms subsampled from the GDB-11 dataset.
The intial structure are randomly embedded into 3D space using RDKit and are optimized with tight convergence
criteria. Normal modes/force constants are computer using the reference DFT model. Finally, Diverse normal
mode sampling (DNMS) is carried out to generate non-equilibrium conformations.
GDB10to13: Consists of 3000 molecules where 500 molecules per 10 and 11 heavy atoms are subsampled from GDB-11
and 1000 molecules per 12 and 13 heavy atom are subsampled from GDB-13. Non-equilibrium conformations are
generated via DNMS.
Tripeptide: Consists of 248 random tripeptides. Structures are optimized similar to GDB7to9.
DrugBank: Consists of 837 molecules subsampled from the original DrugBank database of real drug molecules.
Structures are optimized similar to GDB7to9.
Usage:
from openqdc.datasets import COMP6
dataset = COMP6()
References
https://aip.scitation.org/doi/abs/10.1063/1.5023802
https://github.com/isayev/COMP6
S66x8: https://pubs.rsc.org/en/content/articlehtml/2016/cp/c6cp00688d
GDB-11: https://pubmed.ncbi.nlm.nih.gov/15674983/
GDB-13: https://pubmed.ncbi.nlm.nih.gov/19505099/
DrugBank: https://pubs.acs.org/doi/10.1021/ja902302h
Source code in openqdc/datasets/potential/comp6.py
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93 | class COMP6(BaseDataset):
"""
COMP6 is a benchmark suite consisting of broad regions of bio-chemical and organic space developed for testing the
ANI-1x potential. It is curated from 6 benchmark sets: S66x8, ANI-MD, GDB7to9, GDB10to13, DrugBank, and
Tripeptides. Energies and forces for all non-equilibrium molecular conformations are calculated using
the wB97x density functional with the 6-31G(d) basis set. The dataset also includes Hirshfield charges and
molecular dipoles.
Details of the benchmark sets are as follows:
S66x8: Consists of 66 dimeric systems involving hydrogen bonding, pi-pi stacking, London interactions and
mixed influence interactions.\n
ANI Molecular Dynamics (ANI-MD): Forces from the ANI-1x potential are used for running 1ns vacuum molecular
dynamics with a 0.25fs time step at 300K using the Langevin thermostat of 14 well-known drug molecules and 2 small
proteins. A random subsample of 128 frames from each 1ns trajectory is selected, and reference DFT single point
calculations are performed to calculate energies and forces.\n
GDB7to9: Consists of 1500 molecules where 500 per 7, 8 and 9 heavy atoms subsampled from the GDB-11 dataset.
The intial structure are randomly embedded into 3D space using RDKit and are optimized with tight convergence
criteria. Normal modes/force constants are computer using the reference DFT model. Finally, Diverse normal
mode sampling (DNMS) is carried out to generate non-equilibrium conformations.\n
GDB10to13: Consists of 3000 molecules where 500 molecules per 10 and 11 heavy atoms are subsampled from GDB-11
and 1000 molecules per 12 and 13 heavy atom are subsampled from GDB-13. Non-equilibrium conformations are
generated via DNMS.\n
Tripeptide: Consists of 248 random tripeptides. Structures are optimized similar to GDB7to9.\n
DrugBank: Consists of 837 molecules subsampled from the original DrugBank database of real drug molecules.
Structures are optimized similar to GDB7to9.
Usage:
```python
from openqdc.datasets import COMP6
dataset = COMP6()
```
References:
https://aip.scitation.org/doi/abs/10.1063/1.5023802\n
https://github.com/isayev/COMP6\n
S66x8: https://pubs.rsc.org/en/content/articlehtml/2016/cp/c6cp00688d\n
GDB-11: https://pubmed.ncbi.nlm.nih.gov/15674983/\n
GDB-13: https://pubmed.ncbi.nlm.nih.gov/19505099/\n
DrugBank: https://pubs.acs.org/doi/10.1021/ja902302h
"""
__name__ = "comp6"
# watchout that forces are stored as -grad(E)
__energy_unit__ = "kcal/mol"
__distance_unit__ = "ang" # angstorm
__forces_unit__ = "kcal/mol/ang"
__energy_methods__ = [
PotentialMethod.WB97X_6_31G_D, # "wb97x/6-31g*",
PotentialMethod.B3LYP_D3_BJ_DEF2_TZVP, # "b3lyp-d3(bj)/def2-tzvp",
PotentialMethod.B3LYP_DEF2_TZVP, # "b3lyp/def2-tzvp",
PotentialMethod.HF_DEF2_TZVP, # "hf/def2-tzvp",
PotentialMethod.PBE_D3_BJ_DEF2_TZVP, # "pbe-d3(bj)/def2-tzvp",
PotentialMethod.PBE_DEF2_TZVP, # "pbe/def2-tzvp",
PotentialMethod.SVWN_DEF2_TZVP, # "svwn/def2-tzvp",
]
energy_target_names = [
"Energy",
"B3LYP-D3M(BJ):def2-tzvp",
"B3LYP:def2-tzvp",
"HF:def2-tzvp",
"PBE-D3M(BJ):def2-tzvp",
"PBE:def2-tzvp",
"SVWN:def2-tzvp",
]
__force_mask__ = [True, False, False, False, False, False, False]
force_target_names = [
"Gradient",
]
def __smiles_converter__(self, x):
"""util function to convert string to smiles: useful if the smiles is
encoded in a different format than its display format
"""
return "-".join(x.decode("ascii").split("_")[:-1])
def read_raw_entries(self):
samples = []
for subset in ["ani_md", "drugbank", "gdb7_9", "gdb10_13", "s66x8", "tripeptides"]:
raw_path = p_join(self.root, f"{subset}.h5.gz")
samples += read_qc_archive_h5(raw_path, subset, self.energy_target_names, self.force_target_names)
return samples
|
__smiles_converter__(x)
util function to convert string to smiles: useful if the smiles is
encoded in a different format than its display format
Source code in openqdc/datasets/potential/comp6.py
| def __smiles_converter__(self, x):
"""util function to convert string to smiles: useful if the smiles is
encoded in a different format than its display format
"""
return "-".join(x.decode("ascii").split("_")[:-1])
|