Bases: BaseDataset
Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules
from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry.
For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and
the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the
conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.
Usage:
from openqdc.datasets import GEOM
dataset = GEOM()
References
https://www.nature.com/articles/s41597-022-01288-4
https://github.com/learningmatter-mit/geom
CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d
Source code in openqdc/datasets/potential/geom.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107 | class GEOM(BaseDataset):
"""
Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules
from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry.
For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and
the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the
conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.
Usage:
```python
from openqdc.datasets import GEOM
dataset = GEOM()
```
References:
https://www.nature.com/articles/s41597-022-01288-4\n
https://github.com/learningmatter-mit/geom\n
CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d
"""
__name__ = "geom"
__energy_methods__ = [PotentialMethod.GFN2_XTB]
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
energy_target_names = ["gfn2_xtb.energy"]
force_target_names = []
partitions = ["qm9", "drugs"]
__links__ = {"rdkit_folder.tar.gz": "https://dataverse.harvard.edu/api/access/datafile/4327252"}
def _read_raw_(self, partition):
raw_path = p_join(self.root, "rdkit_folder")
mols = load_json(p_join(raw_path, f"summary_{partition}.json"))
mols = list(mols.items())
fn = lambda x: read_mol(x[0], x[1], raw_path, partition) # noqa E731
samples = dm.parallelized(fn, mols, n_jobs=1, progress=True) # don't use more than 1 job
return samples
def read_raw_entries(self):
samples = sum([self._read_raw_(partition) for partition in self.partitions], [])
return samples
|