MaceOFF

`MACEOFF` ¶

Bases: BaseDataset

MACEOFF dataset core of the dataset consist in the Spice V1 dataset. 95% of the data are used for training and validation under the "train" split, and 5% for testing. The dataset uses the Spice level of theory ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software. MACEOFF uses a subset of SPICE that contains the ten chemical elements H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge. MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules randomly selected from the QMugs dataset. MACEOFF contains a number of water clusters carved out of molecular dynamics simulations of liquid water, with sizes of up to 50 water molecules and part of the COMP6 tripeptide geometry dataset.

Usage:

from openqdc.datasets import MACEOFF
dataset = MACEOFF()

Species

[H, C, N, O, F, P, S, Cl, Br, I]

References

https://arxiv.org/pdf/2312.15211

https://doi.org/10.17863/CAM.107498

Source code in openqdc/datasets/potential/maceoff.py

class MACEOFF(BaseDataset):
    """
    MACEOFF dataset core of the dataset consist in the Spice V1 dataset.
    95% of the data are used for training and validation under the "train" split,
    and 5% for testing. The dataset uses the Spice level of theory
    ωB97M-D3(BJ)/def2-TZVPPD as implemented in the PSI4 software.
    MACEOFF uses a subset of SPICE that contains the ten chemical elements
    H, C, N, O, F, P, S, Cl, Br, and I, and has a neutral formal charge.
    MACEOFF doesn't contain ion pairs. To facilitate the learning of intramolecular
    non-bonded interactions, MACEOFF dataset contains larger 50–90 atom molecules
    randomly selected from the QMugs dataset.
    MACEOFF contains a number of water clusters carved out of molecular dynamics simulations
    of liquid water, with sizes of up to 50 water molecules and part of the
    COMP6 tripeptide geometry dataset.

    Usage:
    ```python
    from openqdc.datasets import MACEOFF
    dataset = MACEOFF()
    ```

    Species:
        [H, C, N, O, F, P, S, Cl, Br, I]

    References:
        https://arxiv.org/pdf/2312.15211\n
        https://doi.org/10.17863/CAM.107498
    """

    __name__ = "maceoff"

    __energy_methods__ = [PotentialMethod.WB97M_D3BJ_DEF2_TZVPPD]
    __force_mask__ = [True]
    __energy_unit__ = "ev"
    __distance_unit__ = "ang"
    __forces_unit__ = "ev/ang"

    energy_target_names = ["dft_total_energy"]
    force_target_names = ["dft_total_gradient"]

    __links__ = {
        "train_large_neut_no_bad_clean.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/b185b5ab-91cf-489a-9302-63bfac42824a/content",  # noqa: E501
        "test_large_neut_all.tar.gz": "https://api.repository.cam.ac.uk/server/api/core/bitstreams/cb8351dd-f09c-413f-921c-67a702a7f0c5/content",  # noqa: E501
    }

    def read_raw_entries(self):
        entries = []
        for filename in self.__links__:
            filename = filename.split(".")[0]
            xyzpath = p_join(self.root, f"{filename}.xyz")
            split = filename.split("_")[0]
            structure_iterator = parse_mace_xyz(xyzpath)
            func = partial(build_data_object, split=split)
            entries.extend(dm.utils.parallelized(func, structure_iterator))
        return entries

    def __getitem__(self, idx):
        data = super().__getitem__(idx)
        data.__setattr__("split", self._convert_array(self.data["split"][idx]))
        return data

MaceOFF

MACEOFF ¶

`MACEOFF` ¶