Bases: BaseDataset
NablaDFT is a dataset constructed from a subset of the
Molecular Sets (MOSES) dataset consisting of 1 million molecules
with 5,340,152 unique conformations. Conformations for each molecule are generated in 2 steps. First, a set of
conformations are generated using RDKit. Second, using Butina Clustering Method on conformations, clusters that
cover 95% of the conformations are selected and the centroids of those clusters are selected as the final set.
This results in 1-62 conformations per molecule. For generating quantum properties, Kohn-Sham method at
wB97X-D/def2-XVP levels are used to generate the energy.
Usage:
from openqdc.datasets import NablaDFT
dataset = NablaDFT()
References
https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D
https://github.com/AIRI-Institute/nablaDFT
Source code in openqdc/datasets/potential/nabladft.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110 | class NablaDFT(BaseDataset):
"""
NablaDFT is a dataset constructed from a subset of the
[Molecular Sets (MOSES) dataset](https://github.com/molecularsets/moses) consisting of 1 million molecules
with 5,340,152 unique conformations. Conformations for each molecule are generated in 2 steps. First, a set of
conformations are generated using RDKit. Second, using Butina Clustering Method on conformations, clusters that
cover 95% of the conformations are selected and the centroids of those clusters are selected as the final set.
This results in 1-62 conformations per molecule. For generating quantum properties, Kohn-Sham method at
wB97X-D/def2-XVP levels are used to generate the energy.
Usage:
```python
from openqdc.datasets import NablaDFT
dataset = NablaDFT()
```
References:
https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D\n
https://github.com/AIRI-Institute/nablaDFT
"""
__name__ = "nabladft"
__energy_methods__ = [
PotentialMethod.WB97X_D_DEF2_SVP,
] # "wb97x-d/def2-svp"
energy_target_names = ["wb97x-d/def2-svp"]
__energy_unit__ = "hartree"
__distance_unit__ = "bohr"
__forces_unit__ = "hartree/bohr"
__links__ = {"nabladft.db": "https://n-usr-31b1j.s3pd12.sbercloud.ru/b-usr-31b1j-qz9/data/moses_db/dataset_full.db"}
@property
def data_types(self):
return {
"atomic_inputs": np.float32,
"position_idx_range": np.int32,
"energies": np.float32,
"forces": np.float32,
}
@requires_package("nablaDFT")
def read_raw_entries(self):
from nablaDFT.dataset import HamiltonianDatabase
label_path = p_join(self.root, "summary.csv")
df = pd.read_csv(label_path, usecols=["MOSES id", "CONFORMER id", "SMILES", "DFT TOTAL ENERGY"])
labels = df.set_index(keys=["MOSES id", "CONFORMER id"]).to_dict("index")
raw_path = p_join(self.root, "dataset_full.db")
train = HamiltonianDatabase(raw_path)
n, c = len(train), 20
step_size = int(np.ceil(n / os.cpu_count()))
fn = lambda i: read_chunk_from_db(raw_path, i * step_size, min((i + 1) * step_size, n), labels=labels)
samples = dm.parallelized(
fn, list(range(c)), n_jobs=c, progress=False, scheduler="threads"
) # don't use more than 1 job
return sum(samples, [])
|