Bases: BaseDataset
Gradient Domain Machine Learning (GDML) is a dataset consisting of samples from ab initio
molecular dynamics (AIMD) trajectories at a resolution of 0.5fs. The dataset consists of, Benzene
(627000 conformations), Uracil (133000 conformations), Naptalene (326000 conformations), Aspirin
(211000 conformations) Salicylic Acid (320000 conformations), Malonaldehyde (993000 conformations),
Ethanol (555000 conformations) and Toluene (100000 conformations). Energy and force labels for
each conformation are computed using the PBE + vdW-TS electronic structure method.
molecular dynamics (AIMD) trajectories.
The dataset consists of the following trajectories
Benzene: 627000 samples
Uracil: 133000 samples
Naptalene: 326000 samples
Aspirin: 211000 samples
Salicylic Acid: 320000 samples
Malonaldehyde: 993000 samples
Ethanol: 555000 samples
Toluene: 100000 samples
Usage:
from openqdc.datasets import GDML
dataset = GDML()
References
https://www.science.org/doi/10.1126/sciadv.1603015
http://www.sgdml.org/#datasets
Source code in openqdc/datasets/potential/gdml.py
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79 | class GDML(BaseDataset):
"""
Gradient Domain Machine Learning (GDML) is a dataset consisting of samples from ab initio
molecular dynamics (AIMD) trajectories at a resolution of 0.5fs. The dataset consists of, Benzene
(627000 conformations), Uracil (133000 conformations), Naptalene (326000 conformations), Aspirin
(211000 conformations) Salicylic Acid (320000 conformations), Malonaldehyde (993000 conformations),
Ethanol (555000 conformations) and Toluene (100000 conformations). Energy and force labels for
each conformation are computed using the PBE + vdW-TS electronic structure method.
molecular dynamics (AIMD) trajectories.
The dataset consists of the following trajectories:
Benzene: 627000 samples\n
Uracil: 133000 samples\n
Naptalene: 326000 samples\n
Aspirin: 211000 samples\n
Salicylic Acid: 320000 samples\n
Malonaldehyde: 993000 samples\n
Ethanol: 555000 samples\n
Toluene: 100000 samples\n
Usage:
```python
from openqdc.datasets import GDML
dataset = GDML()
```
References:
https://www.science.org/doi/10.1126/sciadv.1603015
http://www.sgdml.org/#datasets
"""
__name__ = "gdml"
__energy_methods__ = [
PotentialMethod.CCSD_CC_PVDZ, # "ccsd/cc-pvdz",
PotentialMethod.CCSD_T_CC_PVDZ, # "ccsd(t)/cc-pvdz",
# TODO: verify if basis set vdw-ts == def2-tzvp and
# it is the same in ISO17 and revmd17
PotentialMethod.PBE_DEF2_TZVP, # "pbe/def2-tzvp", # MD17
]
energy_target_names = [
"CCSD Energy",
"CCSD(T) Energy",
"PBE-TS Energy",
]
__force_mask__ = [True, True, True]
force_target_names = [
"CCSD Gradient",
"CCSD(T) Gradient",
"PBE-TS Gradient",
]
__energy_unit__ = "kcal/mol"
__distance_unit__ = "ang"
__forces_unit__ = "kcal/mol/ang"
__links__ = {
"gdb7_9.hdf5.gz": "https://zenodo.org/record/3588361/files/208.hdf5.gz",
"gdb10_13.hdf5.gz": "https://zenodo.org/record/3588364/files/209.hdf5.gz",
"drugbank.hdf5.gz": "https://zenodo.org/record/3588361/files/207.hdf5.gz",
"tripeptides.hdf5.gz": "https://zenodo.org/record/3588368/files/211.hdf5.gz",
"ani_md.hdf5.gz": "https://zenodo.org/record/3588341/files/205.hdf5.gz",
"s66x8.hdf5.gz": "https://zenodo.org/record/3588367/files/210.hdf5.gz",
}
def read_raw_entries(self):
raw_path = p_join(self.root, "gdml.h5.gz")
samples = read_qc_archive_h5(raw_path, "gdml", self.energy_target_names, self.force_target_names)
return samples
|