Bases: BaseDataset
      Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules
from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry.
For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and
the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the
conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.
Usage:
from openqdc.datasets import GEOM
dataset = GEOM()
  References
  https://www.nature.com/articles/s41597-022-01288-4
https://github.com/learningmatter-mit/geom
CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d
 
              
                Source code in openqdc/datasets/potential/geom.py
                |  62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107 | class GEOM(BaseDataset):
    """
    Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules
    from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry.
    For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and
    the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the
    conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.
    Usage:
    ```python
    from openqdc.datasets import GEOM
    dataset = GEOM()
    ```
    References:
        https://www.nature.com/articles/s41597-022-01288-4\n
        https://github.com/learningmatter-mit/geom\n
        CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d
    """
    __name__ = "geom"
    __energy_methods__ = [PotentialMethod.GFN2_XTB]
    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"
    energy_target_names = ["gfn2_xtb.energy"]
    force_target_names = []
    partitions = ["qm9", "drugs"]
    __links__ = {"rdkit_folder.tar.gz": "https://dataverse.harvard.edu/api/access/datafile/4327252"}
    def _read_raw_(self, partition):
        raw_path = p_join(self.root, "rdkit_folder")
        mols = load_json(p_join(raw_path, f"summary_{partition}.json"))
        mols = list(mols.items())
        fn = lambda x: read_mol(x[0], x[1], raw_path, partition)  # noqa E731
        samples = dm.parallelized(fn, mols, n_jobs=1, progress=True)  # don't use more than 1 job
        return samples
    def read_raw_entries(self):
        samples = sum([self._read_raw_(partition) for partition in self.partitions], [])
        return samples
 |