Skip to content

Qmugs

QMugs

Bases: BaseDataset

The QMugs dataset contains 2 million conformers for 665k biologically and pharmacologically relevant molecules extracted from the ChEMBL database. Three geometries per molecule are generated and optimized using the GFN2-xTB method. Using the optimized geometry, the atomic and molecular properties are calculated using both, semi-empirical method (GFN2-xTB) and DFT method (ωB97X-D/def2-SVP).

Usage:

from openqdc.datasets import QMugs
dataset = QMugs()

References

https://arxiv.org/abs/2107.00367

https://www.nature.com/articles/s41597-022-01390-7#ethics

https://www.research-collection.ethz.ch/handle/20.500.11850/482129

Source code in openqdc/datasets/potential/qmugs.py
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
class QMugs(BaseDataset):
    """
    The QMugs dataset contains 2 million conformers for 665k biologically and pharmacologically relevant molecules
    extracted from the ChEMBL database. Three geometries per molecule are generated and optimized using the GFN2-xTB
    method. Using the optimized geometry, the atomic and molecular properties are calculated using both, semi-empirical
    method (GFN2-xTB) and DFT method (ωB97X-D/def2-SVP).

    Usage:
    ```python
    from openqdc.datasets import QMugs
    dataset = QMugs()
    ```

    References:
        https://arxiv.org/abs/2107.00367\n
        https://www.nature.com/articles/s41597-022-01390-7#ethics\n
        https://www.research-collection.ethz.ch/handle/20.500.11850/482129
    """

    __name__ = "qmugs"
    __energy_methods__ = [PotentialMethod.GFN2_XTB, PotentialMethod.WB97X_D_DEF2_SVP]  # "gfn2_xtb", "wb97x-d/def2-svp"
    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"
    __links__ = {
        "summary.csv": "https://libdrive.ethz.ch/index.php/s/X5vOBNSITAG5vzM/download?path=%2F&files=summary.csv",
        "structures.tar.gz": "https://libdrive.ethz.ch/index.php/s/X5vOBNSITAG5vzM/download?path=%2F&files=structures.tar.gz",  # noqa
    }

    energy_target_names = [
        "GFN2:TOTAL_ENERGY",
        "DFT:TOTAL_ENERGY",
    ]

    def read_raw_entries(self):
        raw_path = p_join(self.root, "structures")
        mol_dirs = [p_join(raw_path, d) for d in os.listdir(raw_path)]

        samples = dm.parallelized(read_mol, mol_dirs, n_jobs=-1, progress=True, scheduler="threads")
        return samples

QMugs_V2

Bases: QMugs

QMugs_V2 is an extension of the QMugs dataset containing PM6 labels for each of the 4.2M geometries.

Usage:

from openqdc.datasets import QMugs_V2
dataset = QMugs_V2()

Source code in openqdc/datasets/potential/qmugs.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
class QMugs_V2(QMugs):
    """
    QMugs_V2 is an extension of the QMugs dataset containing PM6 labels for each of the 4.2M geometries.

    Usage:
    ```python
    from openqdc.datasets import QMugs_V2
    dataset = QMugs_V2()
    ```
    """

    __name__ = "qmugs_v2"
    __energy_methods__ = QMugs.__energy_methods__ + [PotentialMethod.PM6]
    energy_target_names = QMugs.energy_target_names + ["PM6"]
    __force_mask__ = QMugs.__force_mask__ + [False]