Skip to content

Protein Fragments

MDDataset

Bases: ProteinFragments

MDDataset is a subset of the proteinfragments dataset that generated from the molecular dynamics with their model. The sampling was done with Molecular Dynamics at room temperature 300K in various solvent phase:

Subsets

Polyalanine: All the polyalanine are sampled in gas phase. AceAla15Lys is a polyalanine peptides capped with an N-terminal acetyl group and a protonated lysine residue at the C-terminus, Acela15nme is polyalanine peptide capped with an N-terminal acetyl group and a C-terminal N-methyl amide group

Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)

Usage:

from openqdc.datasets import MDDataset
dataset = MDDataset()

References

https://www.science.org/doi/10.1126/sciadv.adn4397

Source code in openqdc/datasets/potential/proteinfragments.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
class MDDataset(ProteinFragments):
    """
    MDDataset is a subset of the proteinfragments dataset that
    generated from the molecular dynamics with their model.
    The sampling was done with Molecular Dynamics
    at room temperature 300K in various solvent phase:

    Subsets:
        Polyalanine:
            All the polyalanine are sampled in gas phase. AceAla15Lys is
            a polyalanine peptides capped with an N-terminal acetyl group
            and a protonated lysine residue at the C-terminus,
            Acela15nme is polyalanine peptide capped with an N-terminal acetyl group
            and a C-terminal N-methyl amide group\n
        Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)

    Usage:
    ```python
    from openqdc.datasets import MDDataset
    dataset = MDDataset()
    ```

    References:
        https://www.science.org/doi/10.1126/sciadv.adn4397
    """

    __name__ = "mddataset"

    __links__ = {
        f"{name}.db": f"https://zenodo.org/records/10720941/files/{name}.db?download=1"
        for name in ["acala15nme_folding_clusters", "crambin", "minimahopping_acala15lysh", "minimahopping_acala15nme"]
    }

ProteinFragments

Bases: BaseDataset

ProteinFragments is a dataset constructed from a subset of the the data was generated from a top-down and bottom-up approach:

Top-down

Fragments are generated by cutting out a spherical region around an atom (including solvent molecules) and saturating all dangling bonds. Sampling was done with the Molecular Dynamics (MD) method from conventional FF at room temperature.

Bottom-up

Fragments are generated by constructing chemical graphs of one to eight nonhydrogen atoms. Sampling of multiple conformers per fragments was done with MD simulations at high temperatures or normal mode sampling.

Usage:

from openqdc.datasets import ProteinFragments
dataset = ProteinFragments()

References

https://www.science.org/doi/10.1126/sciadv.adn4397

Source code in openqdc/datasets/potential/proteinfragments.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
class ProteinFragments(BaseDataset):
    """
    ProteinFragments is a dataset constructed from a subset of the
    the data was generated from a top-down and bottom-up approach:

    Top-down:
        Fragments are generated by cutting out a spherical
        region around an atom (including solvent molecules)
        and saturating all dangling bonds.
        Sampling was done with the Molecular Dynamics (MD) method from
        conventional FF at room temperature.

    Bottom-up:
        Fragments are generated by constructing chemical graphs
        of one to eight nonhydrogen atoms.
        Sampling of multiple conformers per fragments was done with
        MD simulations at high temperatures or normal mode sampling.


    Usage:
    ```python
    from openqdc.datasets import ProteinFragments
    dataset = ProteinFragments()
    ```

    References:
        https://www.science.org/doi/10.1126/sciadv.adn4397
    """

    __name__ = "proteinfragments"
    # PBE0/def2-TZVPP+MBD
    __energy_methods__ = [
        PotentialMethod.PBE0_MBD_DEF2_TZVPP,
    ]

    energy_target_names = [
        "PBE0+MBD/def2-TZVPP",
    ]

    __energy_unit__ = "ev"
    __distance_unit__ = "ang"
    __forces_unit__ = "ev/ang"
    __links__ = {
        f"{name}.db": f"https://zenodo.org/records/10720941/files/{name}.db?download=1"
        for name in ["general_protein_fragments"]
    }

    @property
    def root(self):
        return p_join(get_local_cache(), "proteinfragments")

    @property
    def config(self):
        assert len(self.__links__) > 0, "No links provided for fetching"
        return dict(dataset_name="proteinfragments", links=self.__links__)

    @property
    def preprocess_path(self):
        path = p_join(self.root, "preprocessed", self.__name__)
        os.makedirs(path, exist_ok=True)
        return path

    def read_raw_entries(self):
        samples = []
        for name in self.__links__:
            raw_path = p_join(self.root, f"{name}")
            samples.extend(read_db(raw_path))
        return samples