Skip to content

NablaDFT

NablaDFT

Bases: BaseDataset

NablaDFT is a dataset constructed from a subset of the Molecular Sets (MOSES) dataset consisting of 1 million molecules with 5,340,152 unique conformations. Conformations for each molecule are generated in 2 steps. First, a set of conformations are generated using RDKit. Second, using Butina Clustering Method on conformations, clusters that cover 95% of the conformations are selected and the centroids of those clusters are selected as the final set. This results in 1-62 conformations per molecule. For generating quantum properties, Kohn-Sham method at wB97X-D/def2-XVP levels are used to generate the energy.

Usage:

from openqdc.datasets import NablaDFT
dataset = NablaDFT()

References

https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D

https://github.com/AIRI-Institute/nablaDFT

Source code in openqdc/datasets/potential/nabladft.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
class NablaDFT(BaseDataset):
    """
    NablaDFT is a dataset constructed from a subset of the
    [Molecular Sets (MOSES) dataset](https://github.com/molecularsets/moses) consisting of 1 million molecules
    with 5,340,152 unique conformations. Conformations for each molecule are generated in 2 steps. First, a set of
    conformations are generated using RDKit. Second, using Butina Clustering Method on conformations, clusters that
    cover 95% of the conformations are selected and the centroids of those clusters are selected as the final set.
    This results in 1-62 conformations per molecule. For generating quantum properties, Kohn-Sham method at
    wB97X-D/def2-XVP levels are used to generate the energy.

    Usage:
    ```python
    from openqdc.datasets import NablaDFT
    dataset = NablaDFT()
    ```

    References:
        https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D\n
        https://github.com/AIRI-Institute/nablaDFT
    """

    __name__ = "nabladft"
    __energy_methods__ = [
        PotentialMethod.WB97X_D_DEF2_SVP,
    ]  # "wb97x-d/def2-svp"

    energy_target_names = ["wb97x-d/def2-svp"]
    __energy_unit__ = "hartree"
    __distance_unit__ = "bohr"
    __forces_unit__ = "hartree/bohr"
    __links__ = {"nabladft.db": "https://n-usr-31b1j.s3pd12.sbercloud.ru/b-usr-31b1j-qz9/data/moses_db/dataset_full.db"}

    @property
    def data_types(self):
        return {
            "atomic_inputs": np.float32,
            "position_idx_range": np.int32,
            "energies": np.float32,
            "forces": np.float32,
        }

    @requires_package("nablaDFT")
    def read_raw_entries(self):
        from nablaDFT.dataset import HamiltonianDatabase

        label_path = p_join(self.root, "summary.csv")
        df = pd.read_csv(label_path, usecols=["MOSES id", "CONFORMER id", "SMILES", "DFT TOTAL ENERGY"])
        labels = df.set_index(keys=["MOSES id", "CONFORMER id"]).to_dict("index")

        raw_path = p_join(self.root, "dataset_full.db")
        train = HamiltonianDatabase(raw_path)
        n, c = len(train), 20
        step_size = int(np.ceil(n / os.cpu_count()))

        fn = lambda i: read_chunk_from_db(raw_path, i * step_size, min((i + 1) * step_size, n), labels=labels)
        samples = dm.parallelized(
            fn, list(range(c)), n_jobs=c, progress=False, scheduler="threads"
        )  # don't use more than 1 job

        return sum(samples, [])