Skip to content

MaceOFF

MACEOFF

Bases: BaseDataset

MACEOFF dataset core of the dataset consist in the Spice V1 dataset. 95% of the data are used for training and validation under the "train" split, and 5% for testing. The dataset uses the Spice level of theory ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software. MACEOFF uses a subset of SPICE that contains the ten chemical elements H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge. MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules randomly selected from the QMugs dataset. MACEOFF contains a number of water clusters carved out of molecular dynamics simulations of liquid water, with sizes of up to 50 water molecules and part of the COMP6 tripeptide geometry dataset.

Usage:

from openqdc.datasets import MACEOFF
dataset = MACEOFF()

Species

[H, C, N, O, F, P, S, Cl, Br, I]

References

https://arxiv.org/pdf/2312.15211

https://doi.org/10.17863/CAM.107498

Source code in openqdc/datasets/potential/maceoff.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
class MACEOFF(BaseDataset):
    """
    MACEOFF dataset core of the dataset consist in the Spice V1 dataset.
    95% of the data are used for training and validation under the "train" split,
    and 5% for testing. The dataset uses the Spice level of theory
    ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software.
    MACEOFF uses a subset of SPICE that contains the ten chemical elements
    H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge.
    MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular
    non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules
    randomly selected from the QMugs dataset.
    MACEOFF contains a number of water clusters carved out of molecular dynamics simulations
    of liquid water, with sizes of up to 50 water molecules and part of the
    COMP6 tripeptide geometry dataset.

    Usage:
    ```python
    from openqdc.datasets import MACEOFF
    dataset = MACEOFF()
    ```

    Species:
        [H, C, N, O, F, P, S, Cl, Br, I]

    References:
        https://arxiv.org/pdf/2312.15211\n
        https://doi.org/10.17863/CAM.107498
    """

    __name__ = "maceoff"

    __energy_methods__ = [PotentialMethod.WB97M_D3BJ_DEF2_TZVPPD]
    __force_mask__ = [True]
    __energy_unit__ = "ev"
    __distance_unit__ = "ang"
    __forces_unit__ = "ev/ang"

    energy_target_names = ["dft_total_energy"]
    force_target_names = ["dft_total_gradient"]

    __links__ = {
        "train_large_neut_no_bad_clean.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/b185b5ab-91cf-489a-9302-63bfac42824a/content",  # noqa: E501
        "test_large_neut_all.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/cb8351dd-f09c-413f-921c-67a702a7f0c5/content",  # noqa: E501
    }

    def read_raw_entries(self):
        entries = []
        for filename in self.__links__:
            filename = filename.split(".")[0]
            xyzpath = p_join(self.root, f"{filename}.xyz")
            split = filename.split("_")[0]
            structure_iterator = parse_mace_xyz(xyzpath)
            func = partial(build_data_object, split=split)
            entries.extend(dm.utils.parallelized(func, structure_iterator))
        return entries

    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        data.__setattr__("split", self._convert_array(self.data["split"][idx]))
        return data