Skip to content

OpenQDC

Protein Fragments

Protein Fragments

`MDDataset` ¶

Bases: ProteinFragments

MDDataset is a subset of the proteinfragments dataset that generated from the molecular dynamics with their model. The sampling was done with Molecular Dynamics at room temperature 300K in various solvent phase:

Subsets

Polyalanine: All the polyalanine are sampled in gas phase. AceAla15Lys is a polyalanine peptides capped with an N-terminal acetyl group and a protonated lysine residue at the C-terminus, Acela15nme is polyalanine peptide capped with an N-terminal acetyl group and a C-terminal N-methyl amide group

Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)

Usage:

from openqdc.datasets import MDDataset
dataset = MDDataset()

References

https://www.science.org/doi/10.1126/sciadv.adn4397

Source code in openqdc/datasets/potential/proteinfragments.py

class MDDataset(ProteinFragments):
    """
    MDDataset is a subset of the proteinfragments dataset that
    generated from the molecular dynamics with their model.
    The sampling was done with Molecular Dynamics
    at room temperature 300K in various solvent phase:

    Subsets:
        Polyalanine:
            All the polyalanine are sampled in gas phase. AceAla15Lys is
            a polyalanine peptides capped with an N-terminal acetyl group
            and a protonated lysine residue at the C-terminus,
            Acela15nme is polyalanine peptide capped with an N-terminal acetyl group
            and a C-terminal N-methyl amide group\n
        Crambin: 46-residue protein crambin in aqueous solution (25,257 atoms)

    Usage:
    ```python
    from openqdc.datasets import MDDataset
    dataset = MDDataset()
    ```

    References:
        https://www.science.org/doi/10.1126/sciadv.adn4397
    """

    __name__ = "mddataset"

    __links__ = {
        f"{name}.db": f"https://zenodo.org/records/10720941/files/{name}.db?download=1"
        for name in ["acala15nme_folding_clusters", "crambin", "minimahopping_acala15lysh", "minimahopping_acala15nme"]
    }

`ProteinFragments` ¶

Bases: BaseDataset

ProteinFragments is a dataset constructed from a subset of the the data was generated from a top-down and bottom-up approach:

Top-down

Fragments are generated by cutting out a spherical region around an atom (including solvent molecules) and saturating all dangling bonds. Sampling was done with the Molecular Dynamics (MD) method from conventional FF at room temperature.

Bottom-up

Fragments are generated by constructing chemical graphs of one to eight nonhydrogen atoms. Sampling of multiple conformers per fragments was done with MD simulations at high temperatures or normal mode sampling.

Usage:

from openqdc.datasets import ProteinFragments
dataset = ProteinFragments()

References

https://www.science.org/doi/10.1126/sciadv.adn4397

Source code in openqdc/datasets/potential/proteinfragments.py

class ProteinFragments(BaseDataset):
    """
    ProteinFragments is a dataset constructed from a subset of the
    the data was generated from a top-down and bottom-up approach:

    Top-down:
        Fragments are generated by cutting out a spherical
        region around an atom (including solvent molecules)
        and saturating all dangling bonds.
        Sampling was done with the Molecular Dynamics (MD) method from
        conventional FF at room temperature.

    Bottom-up:
        Fragments are generated by constructing chemical graphs
        of one to eight nonhydrogen atoms.
        Sampling of multiple conformers per fragments was done with
        MD simulations at high temperatures or normal mode sampling.


    Usage:
    ```python
    from openqdc.datasets import ProteinFragments
    dataset = ProteinFragments()
    ```

    References:
        https://www.science.org/doi/10.1126/sciadv.adn4397
    """

    __name__ = "proteinfragments"
    # PBE0/def2-TZVPP+MBD
    __energy_methods__ = [
        PotentialMethod.PBE0_MBD_DEF2_TZVPP,
    ]

    energy_target_names = [
        "PBE0+MBD/def2-TZVPP",
    ]

    __energy_unit__ = "ev"
    __distance_unit__ = "ang"
    __forces_unit__ = "ev/ang"
    __links__ = {
        f"{name}.db": f"https://zenodo.org/records/10720941/files/{name}.db?download=1"
        for name in ["general_protein_fragments"]
    }

    @property
    def root(self):
        return p_join(get_local_cache(), "proteinfragments")

    @property
    def config(self):
        assert len(self.__links__) > 0, "No links provided for fetching"
        return dict(dataset_name="proteinfragments", links=self.__links__)

    @property
    def preprocess_path(self):
        path = p_join(self.root, "preprocessed", self.__name__)
        os.makedirs(path, exist_ok=True)
        return path

    def read_raw_entries(self):
        samples = []
        for name in self.__links__:
            raw_path = p_join(self.root, f"{name}")
            samples.extend(read_db(raw_path))
        return samples