Skip to content

Comp6

COMP6

Bases: BaseDataset

COMP6 is a benchmark suite consisting of broad regions of bio-chemical and organic space developed for testing the ANI-1x potential. It is curated from 6 benchmark sets: S66x8, ANI-MD, GDB7to9, GDB10to13, DrugBank, and Tripeptides. Energies and forces for all non-equilibrium molecular conformations are calculated using the wB97x density functional with the 6-31G(d) basis set. The dataset also includes Hirshfield charges and molecular dipoles.

Details of the benchmark sets are as follows

S66x8: Consists of 66 dimeric systems involving hydrogen bonding, pi-pi stacking, London interactions and

mixed influence interactions.

ANI Molecular Dynamics (ANI-MD): Forces from the ANI-1x potential are used for running 1ns vacuum molecular

dynamics with a 0.25fs time step at 300K using the Langevin thermostat of 14 well-known drug molecules and 2 small proteins. A random subsample of 128 frames from each 1ns trajectory is selected, and reference DFT single point calculations are performed to calculate energies and forces.

GDB7to9: Consists of 1500 molecules where 500 per 7, 8 and 9 heavy atoms subsampled from the GDB-11 dataset.

The intial structure are randomly embedded into 3D space using RDKit and are optimized with tight convergence criteria. Normal modes/force constants are computer using the reference DFT model. Finally, Diverse normal mode sampling (DNMS) is carried out to generate non-equilibrium conformations.

GDB10to13: Consists of 3000 molecules where 500 molecules per 10 and 11 heavy atoms are subsampled from GDB-11

and 1000 molecules per 12 and 13 heavy atom are subsampled from GDB-13. Non-equilibrium conformations are generated via DNMS.

Tripeptide: Consists of 248 random tripeptides. Structures are optimized similar to GDB7to9.

DrugBank: Consists of 837 molecules subsampled from the original DrugBank database of real drug molecules.

Structures are optimized similar to GDB7to9.

Usage:

from openqdc.datasets import COMP6
dataset = COMP6()

References

https://aip.scitation.org/doi/abs/10.1063/1.5023802

https://github.com/isayev/COMP6

S66x8: https://pubs.rsc.org/en/content/articlehtml/2016/cp/c6cp00688d

GDB-11: https://pubmed.ncbi.nlm.nih.gov/15674983/

GDB-13: https://pubmed.ncbi.nlm.nih.gov/19505099/

DrugBank: https://pubs.acs.org/doi/10.1021/ja902302h

Source code in openqdc/datasets/potential/comp6.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
class COMP6(BaseDataset):
    """
    COMP6 is a benchmark suite consisting of broad regions of bio-chemical and organic space developed for testing the
    ANI-1x potential. It is curated from 6 benchmark sets: S66x8, ANI-MD, GDB7to9, GDB10to13, DrugBank, and
    Tripeptides. Energies and forces for all non-equilibrium molecular conformations are calculated using
    the wB97x density functional with the 6-31G(d) basis set. The dataset also includes Hirshfield charges and
    molecular dipoles.

    Details of the benchmark sets are as follows:
        S66x8: Consists of 66 dimeric systems involving hydrogen bonding, pi-pi stacking, London interactions and
    mixed influence interactions.\n
        ANI Molecular Dynamics (ANI-MD): Forces from the ANI-1x potential are used for running 1ns vacuum molecular
    dynamics with a 0.25fs time step at 300K using the Langevin thermostat of 14 well-known drug molecules and 2 small
    proteins. A random subsample of 128 frames from each 1ns trajectory is selected, and reference DFT single point
    calculations are performed to calculate energies and forces.\n
        GDB7to9: Consists of 1500 molecules where 500 per 7, 8 and 9 heavy atoms subsampled from the GDB-11 dataset.
    The intial structure are randomly embedded into 3D space using RDKit and are optimized with tight convergence
    criteria. Normal modes/force constants are computer using the reference DFT model. Finally, Diverse normal
    mode sampling (DNMS) is carried out to generate non-equilibrium conformations.\n
        GDB10to13: Consists of 3000 molecules where 500 molecules per 10 and 11 heavy atoms are subsampled from GDB-11
    and 1000 molecules per 12 and 13 heavy atom are subsampled from GDB-13. Non-equilibrium conformations are
    generated via DNMS.\n
        Tripeptide: Consists of 248 random tripeptides. Structures are optimized similar to GDB7to9.\n
        DrugBank: Consists of 837 molecules subsampled from the original DrugBank database of real drug molecules.
    Structures are optimized similar to GDB7to9.

    Usage:
    ```python
    from openqdc.datasets import COMP6
    dataset = COMP6()
    ```

    References:
        https://aip.scitation.org/doi/abs/10.1063/1.5023802\n
        https://github.com/isayev/COMP6\n
        S66x8: https://pubs.rsc.org/en/content/articlehtml/2016/cp/c6cp00688d\n
        GDB-11: https://pubmed.ncbi.nlm.nih.gov/15674983/\n
        GDB-13: https://pubmed.ncbi.nlm.nih.gov/19505099/\n
        DrugBank: https://pubs.acs.org/doi/10.1021/ja902302h
    """

    __name__ = "comp6"

    # watchout that forces are stored as -grad(E)
    __energy_unit__ = "kcal/mol"
    __distance_unit__ = "ang"  # angstorm
    __forces_unit__ = "kcal/mol/ang"

    __energy_methods__ = [
        PotentialMethod.WB97X_6_31G_D,  # "wb97x/6-31g*",
        PotentialMethod.B3LYP_D3_BJ_DEF2_TZVP,  # "b3lyp-d3(bj)/def2-tzvp",
        PotentialMethod.B3LYP_DEF2_TZVP,  # "b3lyp/def2-tzvp",
        PotentialMethod.HF_DEF2_TZVP,  # "hf/def2-tzvp",
        PotentialMethod.PBE_D3_BJ_DEF2_TZVP,  # "pbe-d3(bj)/def2-tzvp",
        PotentialMethod.PBE_DEF2_TZVP,  # "pbe/def2-tzvp",
        PotentialMethod.SVWN_DEF2_TZVP,  # "svwn/def2-tzvp",
    ]

    energy_target_names = [
        "Energy",
        "B3LYP-D3M(BJ):def2-tzvp",
        "B3LYP:def2-tzvp",
        "HF:def2-tzvp",
        "PBE-D3M(BJ):def2-tzvp",
        "PBE:def2-tzvp",
        "SVWN:def2-tzvp",
    ]
    __force_mask__ = [True, False, False, False, False, False, False]

    force_target_names = [
        "Gradient",
    ]

    def __smiles_converter__(self, x):
        """util function to convert string to smiles: useful if the smiles is
        encoded in a different format than its display format
        """
        return "-".join(x.decode("ascii").split("_")[:-1])

    def read_raw_entries(self):
        samples = []
        for subset in ["ani_md", "drugbank", "gdb7_9", "gdb10_13", "s66x8", "tripeptides"]:
            raw_path = p_join(self.root, f"{subset}.h5.gz")
            samples += read_qc_archive_h5(raw_path, subset, self.energy_target_names, self.force_target_names)

        return samples

__smiles_converter__(x)

util function to convert string to smiles: useful if the smiles is encoded in a different format than its display format

Source code in openqdc/datasets/potential/comp6.py
81
82
83
84
85
def __smiles_converter__(self, x):
    """util function to convert string to smiles: useful if the smiles is
    encoded in a different format than its display format
    """
    return "-".join(x.decode("ascii").split("_")[:-1])