GEOM

Bases: BaseDataset

Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry. For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.

Usage:

from openqdc.datasets import GEOM
dataset = GEOM()

References

https://www.nature.com/articles/s41597-022-01288-4

https://github.com/learningmatter-mit/geom

CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d

Source code in openqdc/datasets/potential/geom.py

class GEOM(BaseDataset):
    """
    Geometric Ensemble Of Molecules (GEOM) dataset contains 37 million conformers for 133,000 molecules
    from QM9, and 317,000 molecules with experimental data related to biophysics, physiology, and physical chemistry.
    For each molecule, the initial structure is generated with RDKit, optimized with the GFN2-xTB energy method and
    the lowest energy conformer is fed to the CREST software. CREST software uses metadynamics for exploring the
    conformational space for each molecule. Energies in the dataset are computed using semi-empirical method GFN2-xTB.

    Usage:
    ```python
    from openqdc.datasets import GEOM
    dataset = GEOM()
    ```

    References:
        https://www.nature.com/articles/s41597-022-01288-4\n
        https://github.com/learningmatter-mit/geom\n
        CREST Software: https://pubs.rsc.org/en/content/articlelanding/2020/cp/c9cp06869d
    """

    __name__ = "geom"
    __energy_methods__ = [PotentialMethod.GFN2_XTB]

    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"

    energy_target_names = ["gfn2_xtb.energy"]
    force_target_names = []

    partitions = ["qm9", "drugs"]
    __links__ = {"rdkit_folder.tar.gz": "https://dataverse.harvard.edu/api/access/datafile/4327252"}

    def _read_raw_(self, partition):
        raw_path = p_join(self.root, "rdkit_folder")

        mols = load_json(p_join(raw_path, f"summary_{partition}.json"))
        mols = list(mols.items())

        fn = lambda x: read_mol(x[0], x[1], raw_path, partition)  # noqa E731
        samples = dm.parallelized(fn, mols, n_jobs=1, progress=True)  # don't use more than 1 job
        return samples

    def read_raw_entries(self):
        samples = sum([self._read_raw_(partition) for partition in self.partitions], [])
        return samples