Skip to content

Orbnet Denali

OrbnetDenali

Bases: BaseDataset

Orbnet Denali is a collection of 2.3 million conformers from 212,905 unique molecules. Molecules include a range of organic molecules with protonation and tautomeric states, non-covalent interactions, common salts, and counterions, spanning the most common elements in bio and organic chemistry. Geometries are generated in 2 steps. First, four energy-minimized conformations are generated for each molecule using the ENTOS BREEZE conformer generator. Second, using the four energy-minimized conformers, non-equilibrium geometries are generated using normal mode sampling at 300K or ab initio molecular dynamics (AIMD) for 200fs at 500K; using GFN1-xTB level of theory. Energies are calculated using DFT method wB97X-D3/def2-TZVP and semi-empirical method GFN1-xTB level of theory.

Usage:

from openqdc.datasets import OrbnetDenali
dataset = OrbnetDenali()

References

https://arxiv.org/abs/2107.00299

https://figshare.com/articles/dataset/OrbNet_Denali_Training_Data/14883867

Source code in openqdc/datasets/potential/orbnet_denali.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
class OrbnetDenali(BaseDataset):
    """
    Orbnet Denali is a collection of 2.3 million conformers from 212,905 unique molecules. Molecules include a range
    of organic molecules with protonation and tautomeric states, non-covalent interactions, common salts, and
    counterions, spanning the most common elements in bio and organic chemistry. Geometries are generated in 2 steps.
    First, four energy-minimized conformations are generated for each molecule using the ENTOS BREEZE conformer
    generator. Second, using the four energy-minimized conformers, non-equilibrium geometries are generated using
    normal mode sampling at 300K or ab initio molecular dynamics (AIMD) for 200fs at 500K; using GFN1-xTB level of
    theory. Energies are calculated using DFT method wB97X-D3/def2-TZVP and semi-empirical method GFN1-xTB level of
    theory.

    Usage:
    ```python
    from openqdc.datasets import OrbnetDenali
    dataset = OrbnetDenali()
    ```

    References:
        https://arxiv.org/abs/2107.00299\n
        https://figshare.com/articles/dataset/OrbNet_Denali_Training_Data/14883867
    """

    __name__ = "orbnet_denali"
    __energy_methods__ = [
        PotentialMethod.WB97X_D3_DEF2_TZVP,
        PotentialMethod.GFN1_XTB,
    ]  # ["wb97x-d3/def2-tzvp", "gfn1_xtb"]
    energy_target_names = ["dft_energy", "xtb1_energy"]
    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"
    __links__ = {
        "orbnet_denali.tar.gz": "https://figshare.com/ndownloader/files/28672287",
        "orbnet_denali_targets.tar.gz": "https://figshare.com/ndownloader/files/28672248",
    }

    def read_raw_entries(self):
        label_path = p_join(self.root, "denali_labels.csv")
        df = pd.read_csv(label_path, usecols=["sample_id", "mol_id", "subset", "dft_energy", "xtb1_energy"])
        labels = {
            mol_id: group.drop(["mol_id"], axis=1).drop_duplicates("sample_id").set_index("sample_id").to_dict("index")
            for mol_id, group in df.groupby("mol_id")
        }

        fn = lambda x: read_archive(x[0], x[1], self.root, self.energy_target_names)
        res = dm.parallelized(fn, list(labels.items()), scheduler="threads", n_jobs=-1, progress=True)
        samples = sum(res, [])
        return samples