Skip to content

Spice

Spice

Bases: BaseDataset

Spice dataset consists of 1.1 million conformations for a diverse set of 19k unique molecules consisting of small molecules, dimers, dipeptides, and solvated amino acids. Conformations are first generated with RDKit, and then molecular dynamics simulations at 100ps and 500K using OpenMM and Amber force field are used to generate 100 high energy conformations. Low-energy conformations are then generated by L-BFGS energy minimization and molecular dynamics at 1ps and 100K. Forces and energies for conformations are calculated at the wB97M-D3(BJ)/def2-TZVPPD level of theory.

Usage:

from openqdc.datasets import Spice
dataset = Spice()

References

https://arxiv.org/abs/2209.10702

https://github.com/openmm/spice-dataset

Source code in openqdc/datasets/potential/spice.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
class Spice(BaseDataset):
    """
    Spice dataset consists of 1.1 million conformations for a diverse set of 19k unique molecules consisting of
    small molecules, dimers, dipeptides, and solvated amino acids. Conformations are first generated with RDKit,
    and then molecular dynamics simulations at 100ps and 500K using OpenMM and Amber force field are used to generate
    100 high energy conformations. Low-energy conformations are then generated by L-BFGS energy minimization and
    molecular dynamics at 1ps and 100K. Forces and energies for conformations are calculated at the
    wB97M-D3(BJ)/def2-TZVPPD level of theory.

    Usage:
    ```python
    from openqdc.datasets import Spice
    dataset = Spice()
    ```

    References:
        https://arxiv.org/abs/2209.10702\n
        https://github.com/openmm/spice-dataset
    """

    __name__ = "spice"
    __energy_methods__ = [PotentialMethod.WB97M_D3BJ_DEF2_TZVPPD]
    __force_mask__ = [True]
    __energy_unit__ = "hartree"
    __distance_unit__ = "bohr"
    __forces_unit__ = "hartree/bohr"

    energy_target_names = ["dft_total_energy"]

    force_target_names = ["dft_total_gradient"]

    subset_mapping = {
        "SPICE Solvated Amino Acids Single Points Dataset v1.1": "Solvated Amino Acids",
        "SPICE Dipeptides Single Points Dataset v1.2": "Dipeptides",
        "SPICE DES Monomers Single Points Dataset v1.1": "DES370K Monomers",
        "SPICE DES370K Single Points Dataset v1.0": "DES370K Dimers",
        "SPICE DES370K Single Points Dataset Supplement v1.0": "DES370K Dimers",
        "SPICE PubChem Set 1 Single Points Dataset v1.2": "PubChem",
        "SPICE PubChem Set 2 Single Points Dataset v1.2": "PubChem",
        "SPICE PubChem Set 3 Single Points Dataset v1.2": "PubChem",
        "SPICE PubChem Set 4 Single Points Dataset v1.2": "PubChem",
        "SPICE PubChem Set 5 Single Points Dataset v1.2": "PubChem",
        "SPICE PubChem Set 6 Single Points Dataset v1.2": "PubChem",
        "SPICE Ion Pairs Single Points Dataset v1.1": "Ion Pairs",
    }
    __links__ = {"SPICE-1.1.4.hdf5": "https://zenodo.org/record/8222043/files/SPICE-1.1.4.hdf5"}

    def convert_forces(self, x):
        return (-1.0) * super().convert_forces(x)

    def read_raw_entries(self):
        raw_path = p_join(self.root, "SPICE-1.1.4.hdf5")

        data = load_hdf5_file(raw_path)
        tmp = [read_record(data[mol_name], self) for mol_name in tqdm(data)]  # don't use parallelized here

        return tmp

SpiceV2

Bases: Spice

SpiceV2 dataset augments the Spice data with amino acids complexes, water boxes, pubchem solvated molecules. The main changes include, (1) over 13,000 new PubChem molecules, out of which 1500 contain boron and 1900 contain silicon, (2) 194,000 conformations of dimers containing amino acid and ligands, (3) 1000 water clusters to improve sampling interactions in bulk water, (4) 1397 PubChem molecules solvated with a shell of water molecules, and (5) Fixing bad calculations from the Spice dataset. The data generation process is the same as the Spice dataset.

Usage:

from openqdc.datasets import SpiceV2
dataset = SpiceV2()

References

https://github.com/openmm/spice-dataset/releases/tag/2.0.0

https://github.com/openmm/spice-dataset

Source code in openqdc/datasets/potential/spice.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
class SpiceV2(Spice):
    """
    SpiceV2 dataset augments the Spice data with amino acids complexes, water boxes, pubchem solvated molecules.
    The main changes include, (1) over 13,000 new PubChem molecules, out of which 1500 contain boron and 1900 contain
    silicon, (2) 194,000 conformations of dimers containing amino acid and ligands, (3) 1000 water clusters to improve
    sampling interactions in bulk water, (4) 1397 PubChem molecules solvated with a shell of water molecules, and
    (5) Fixing bad calculations from the Spice dataset. The data generation process is the same as the Spice dataset.

    Usage:
    ```python
    from openqdc.datasets import SpiceV2
    dataset = SpiceV2()
    ```

    References:
        https://github.com/openmm/spice-dataset/releases/tag/2.0.0\n
        https://github.com/openmm/spice-dataset
    """

    __name__ = "spicev2"

    subset_mapping = {
        "SPICE Dipeptides Single Points Dataset v1.3": "Dipeptides",
        "SPICE Solvated Amino Acids Single Points Dataset v1.1": "Solvated Amino Acids",
        "SPICE Water Clusters v1.0": "Water Clusters",
        "SPICE Solvated PubChem Set 1 v1.0": "Solvated PubChem",
        "SPICE Amino Acid Ligand v1.0": "Amino Acid Ligand",
        "SPICE PubChem Set 1 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 2 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 3 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 4 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 5 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 6 Single Points Dataset v1.3": "PubChem",
        "SPICE PubChem Set 7 Single Points Dataset v1.0": "PubChemv2",
        "SPICE PubChem Set 8 Single Points Dataset v1.0": "PubChemv2",
        "SPICE PubChem Set 9 Single Points Dataset v1.0": "PubChemv2",
        "SPICE PubChem Set 10 Single Points Dataset v1.0": "PubChemv2",
        "SPICE DES Monomers Single Points Dataset v1.1": "DES370K Monomers",
        "SPICE DES370K Single Points Dataset v1.0": "DES370K Dimers",
        "SPICE DES370K Single Points Dataset Supplement v1.1": "DES370K Dimers",
        "SPICE PubChem Boron Silicon v1.0": "PubChem Boron Silicon",
        "SPICE Ion Pairs Single Points Dataset v1.2": "Ion Pairs",
    }
    __links__ = {"spice-2.0.0.hdf5": "https://zenodo.org/records/10835749/files/SPICE-2.0.0.hdf5?download=1"}

    def read_raw_entries(self):
        raw_path = p_join(self.root, "spice-2.0.0.hdf5")

        data = load_hdf5_file(raw_path)
        # Entry 40132 without positions, skip it
        # don't use parallelized here
        tmp = [read_record(data[mol_name], self) for i, mol_name in enumerate(tqdm(data)) if i != 40132]

        return tmp

SpiceVL2

Bases: SpiceV2

SpiceVL2 is an extension of the SpiceV2 dataset with additional semi-empirical GFN2-xTB and PM6 energy methods.

Usage:

from openqdc.datasets import SpiceVL2
dataset = SpiceVL2()

References

https://github.com/openmm/spice-dataset/releases/tag/2.0.0

https://github.com/openmm/spice-dataset

Source code in openqdc/datasets/potential/spice.py
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
class SpiceVL2(SpiceV2):
    """
    SpiceVL2 is an extension of the SpiceV2 dataset with additional semi-empirical GFN2-xTB and PM6 energy methods.

    Usage:
    ```python
    from openqdc.datasets import SpiceVL2
    dataset = SpiceVL2()
    ```

    References:
        https://github.com/openmm/spice-dataset/releases/tag/2.0.0\n
        https://github.com/openmm/spice-dataset
    """

    __name__ = "spice_vl2"

    __energy_methods__ = SpiceV2.__energy_methods__ + [PotentialMethod.GFN2_XTB, PotentialMethod.PM6]
    energy_target_names = SpiceV2.energy_target_names + ["GFN2," "PM6"]
    __force_mask__ = SpiceV2.__force_mask__ + [False, False]

read_record(r, obj)

Read record from hdf5 file. r : hdf5 record obj : Spice class object used to grab subset and names

Source code in openqdc/datasets/potential/spice.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def read_record(r, obj):
    """
    Read record from hdf5 file.
        r : hdf5 record
        obj : Spice class object used to grab subset and names
    """
    smiles = r["smiles"].asstr()[0]
    subset = r["subset"][0].decode("utf-8")
    n_confs = r["conformations"].shape[0]
    x = get_atomic_number_and_charge(dm.to_mol(smiles, remove_hs=False, ordered=True))
    positions = r["conformations"][:]

    res = dict(
        name=np.array([smiles] * n_confs),
        subset=np.array([obj.subset_mapping[subset]] * n_confs),
        energies=r[obj.energy_target_names[0]][:][:, None].astype(np.float64),
        forces=r[obj.force_target_names[0]][:].reshape(
            -1, 3, 1
        ),  # forces -ve of energy gradient but the -1.0 is done in the convert_forces method
        atomic_inputs=np.concatenate(
            (x[None, ...].repeat(n_confs, axis=0), positions), axis=-1, dtype=np.float32
        ).reshape(-1, 5),
        n_atoms=np.array([x.shape[0]] * n_confs, dtype=np.int32),
    )

    return res