Skip to content

Alchemy

Alchemy

Bases: BaseDataset

Alchemy comprises of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. Molecular properties are calculated using PySCF's implementation of the DFT Kohn-Sham method at the B3LYP level with the basis set 6-31G(2df,p). The equilibrium geometry is optimized in three passes. First, OpenBabel is used to parse SMILES string and build the Cartesian coordinates with MMFF94 force field optimization. Second, HF/STO3G is used to generate the preliminary geometry. Third, for the final pass of geometry relaxation, the B3LYP/6-31G(2df,p) model with the density fittting approximation for electron repulsion integrals is used. The auxillary basis cc-pVDZ-jkfit is employed in density fitting to build the Coulomb matrix and the HF exchange matrix.

Usage:

from openqdc.datasets import Alchemy
dataset = Alchemy()

Reference

https://arxiv.org/abs/1906.09427 https://alchemy.tencent.com/

Source code in openqdc/datasets/potential/alchemy.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
class Alchemy(BaseDataset):
    """
    Alchemy comprises of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.
    Molecular properties are calculated using PySCF's implementation of the DFT Kohn-Sham method at the B3LYP level
    with the basis set 6-31G(2df,p). The equilibrium geometry is optimized in three passes. First, OpenBabel is used
    to parse SMILES string and build the Cartesian coordinates with MMFF94 force field optimization. Second, HF/STO3G
    is used to generate the preliminary geometry. Third, for the final pass of geometry relaxation, the
    B3LYP/6-31G(2df,p) model with the density fittting approximation for electron repulsion integrals is used. The
    auxillary basis cc-pVDZ-jkfit is employed in density fitting to build the Coulomb matrix and the HF exchange
    matrix.

    Usage:
    ```python
    from openqdc.datasets import Alchemy
    dataset = Alchemy()
    ```

    Reference:
        https://arxiv.org/abs/1906.09427
        https://alchemy.tencent.com/
    """

    __name__ = "alchemy"

    __energy_methods__ = [
        PotentialMethod.WB97X_6_31G_D,  # "wb97x/6-31g(d)"
    ]

    energy_target_names = [
        "ωB97x:6-31G(d) Energy",
    ]

    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"
    __links__ = {"alchemy.zip": "https://alchemy.tencent.com/data/alchemy-v20191129.zip"}

    def read_raw_entries(self):
        dir_path = p_join(self.root, "Alchemy-v20191129")
        full_csv = pd.read_csv(p_join(dir_path, "final_version.csv"))
        energies = full_csv["U0\n(Ha, internal energy at 0 K)"].tolist()
        atom_folder = full_csv["atom number"]
        gdb_idx = full_csv["gdb_idx"]
        idxs = full_csv.index.tolist()
        samples = []
        for i in tqdm(idxs):
            sdf_file = p_join(dir_path, f"atom_{atom_folder[i]}", f"{gdb_idx[i]}.sdf")
            energy = energies[i]
            samples.append(read_mol(sdf_file, energy))
        return samples