Bases: BaseDataset
MACEOFF dataset core of the dataset consist in the Spice V1 dataset.
95% of the data are used for training and validation under the "train" split,
and 5% for testing. The dataset uses the Spice level of theory
ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software.
MACEOFF uses a subset of SPICE that contains the ten chemical elements
H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge.
MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular
non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules
randomly selected from the QMugs dataset.
MACEOFF contains a number of water clusters carved out of molecular dynamics simulations
of liquid water, with sizes of up to 50 water molecules and part of the
COMP6 tripeptide geometry dataset.
Usage:
from openqdc.datasets import MACEOFF
dataset = MACEOFF()
Species
[H, C, N, O, F, P, S, Cl, Br, I]
References
https://arxiv.org/pdf/2312.15211
https://doi.org/10.17863/CAM.107498
Source code in openqdc/datasets/potential/maceoff.py
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133 | class MACEOFF(BaseDataset):
"""
MACEOFF dataset core of the dataset consist in the Spice V1 dataset.
95% of the data are used for training and validation under the "train" split,
and 5% for testing. The dataset uses the Spice level of theory
ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software.
MACEOFF uses a subset of SPICE that contains the ten chemical elements
H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge.
MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular
non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules
randomly selected from the QMugs dataset.
MACEOFF contains a number of water clusters carved out of molecular dynamics simulations
of liquid water, with sizes of up to 50 water molecules and part of the
COMP6 tripeptide geometry dataset.
Usage:
```python
from openqdc.datasets import MACEOFF
dataset = MACEOFF()
```
Species:
[H, C, N, O, F, P, S, Cl, Br, I]
References:
https://arxiv.org/pdf/2312.15211\n
https://doi.org/10.17863/CAM.107498
"""
__name__ = "maceoff"
__energy_methods__ = [PotentialMethod.WB97M_D3BJ_DEF2_TZVPPD]
__force_mask__ = [True]
__energy_unit__ = "ev"
__distance_unit__ = "ang"
__forces_unit__ = "ev/ang"
energy_target_names = ["dft_total_energy"]
force_target_names = ["dft_total_gradient"]
__links__ = {
"train_large_neut_no_bad_clean.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/b185b5ab-91cf-489a-9302-63bfac42824a/content", # noqa: E501
"test_large_neut_all.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/cb8351dd-f09c-413f-921c-67a702a7f0c5/content", # noqa: E501
}
def read_raw_entries(self):
entries = []
for filename in self.__links__:
filename = filename.split(".")[0]
xyzpath = p_join(self.root, f"{filename}.xyz")
split = filename.split("_")[0]
structure_iterator = parse_mace_xyz(xyzpath)
func = partial(build_data_object, split=split)
entries.extend(dm.utils.parallelized(func, structure_iterator))
return entries
def __getitem__(self, idx):
data = super().__getitem__(idx)
data.__setattr__("split", self._convert_array(self.data["split"][idx]))
return data
|