Bases: BaseDataset
Alchemy comprises of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.
Molecular properties are calculated using PySCF's implementation of the DFT Kohn-Sham method at the B3LYP level
with the basis set 6-31G(2df,p). The equilibrium geometry is optimized in three passes. First, OpenBabel is used
to parse SMILES string and build the Cartesian coordinates with MMFF94 force field optimization. Second, HF/STO3G
is used to generate the preliminary geometry. Third, for the final pass of geometry relaxation, the
B3LYP/6-31G(2df,p) model with the density fittting approximation for electron repulsion integrals is used. The
auxillary basis cc-pVDZ-jkfit is employed in density fitting to build the Coulomb matrix and the HF exchange
matrix.
Usage:
from openqdc.datasets import Alchemy
dataset = Alchemy()
Reference
https://arxiv.org/abs/1906.09427
https://alchemy.tencent.com/
Source code in openqdc/datasets/potential/alchemy.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95 | class Alchemy(BaseDataset):
"""
Alchemy comprises of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database.
Molecular properties are calculated using PySCF's implementation of the DFT Kohn-Sham method at the B3LYP level
with the basis set 6-31G(2df,p). The equilibrium geometry is optimized in three passes. First, OpenBabel is used
to parse SMILES string and build the Cartesian coordinates with MMFF94 force field optimization. Second, HF/STO3G
is used to generate the preliminary geometry. Third, for the final pass of geometry relaxation, the
B3LYP/6-31G(2df,p) model with the density fittting approximation for electron repulsion integrals is used. The
auxillary basis cc-pVDZ-jkfit is employed in density fitting to build the Coulomb matrix and the HF exchange
matrix.
Usage:
```python
from openqdc.datasets import Alchemy
dataset = Alchemy()
```
Reference:
https://arxiv.org/abs/1906.09427
https://alchemy.tencent.com/
"""
__name__ = "alchemy"
__energy_methods__ = [
PotentialMethod.WB97X_6_31G_D, # "wb97x/6-31g(d)"
]
energy_target_names = [
"ωB97x:6-31G(d) Energy",
]
__energy_unit__ = "hartree"
__distance_unit__ = "ang"
__forces_unit__ = "hartree/ang"
__links__ = {"alchemy.zip": "https://alchemy.tencent.com/data/alchemy-v20191129.zip"}
def read_raw_entries(self):
dir_path = p_join(self.root, "Alchemy-v20191129")
full_csv = pd.read_csv(p_join(dir_path, "final_version.csv"))
energies = full_csv["U0\n(Ha, internal energy at 0 K)"].tolist()
atom_folder = full_csv["atom number"]
gdb_idx = full_csv["gdb_idx"]
idxs = full_csv.index.tolist()
samples = []
for i in tqdm(idxs):
sdf_file = p_join(dir_path, f"atom_{atom_folder[i]}", f"{gdb_idx[i]}.sdf")
energy = energies[i]
samples.append(read_mol(sdf_file, energy))
return samples
|