Skip to content

TMQM

TMQM

Bases: BaseDataset

tmQM dataset contains the geometries of a large transition metal-organic compound space with a large variety of organic ligands and 30 transition metals. It contains energy labels for 86,665 mononuclear complexes calculated at the TPSSh-D3BJ/def2-SV DFT level of theory. Structures are first extracted from Cambridge Structure Database and then optimized in gas phase with the extended tight-binding GFN2-xTB method.

Usage:

from openqdc.datasets import TMQM
dataset = TMQM()

References

https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041

https://github.com/bbskjelstad/tmqm

Source code in openqdc/datasets/potential/tmqm.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class TMQM(BaseDataset):
    """
    tmQM dataset contains the geometries of a large transition metal-organic compound space with a large variety of
    organic ligands and 30 transition metals. It contains energy labels for 86,665 mononuclear complexes calculated
    at the TPSSh-D3BJ/def2-SV DFT level of theory. Structures are first extracted from Cambridge Structure Database
    and then optimized in gas phase with the extended tight-binding GFN2-xTB method.

    Usage:
    ```python
    from openqdc.datasets import TMQM
    dataset = TMQM()
    ```

    References:
        https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041\n
        https://github.com/bbskjelstad/tmqm
    """

    __name__ = "tmqm"

    __energy_methods__ = [PotentialMethod.TPSSH_DEF2_TZVP]  # "tpssh/def2-tzvp"]

    energy_target_names = ["TPSSh/def2TZVP level"]

    __energy_unit__ = "hartree"
    __distance_unit__ = "ang"
    __forces_unit__ = "hartree/ang"
    __links__ = {
        x: f"https://raw.githubusercontent.com/bbskjelstad/tmqm/master/data/{x}"
        for x in ["tmQM_X1.xyz.gz", "tmQM_X2.xyz.gz", "tmQM_y.csv", "Benchmark2_TPSSh_Opt.xyz"]
    }

    def read_raw_entries(self):
        df = pd.read_csv(p_join(self.root, "tmQM_y.csv"), sep=";", usecols=["CSD_code", "Electronic_E"])
        e_map = dict(zip(df["CSD_code"], df["Electronic_E"]))
        raw_fnames = ["tmQM_X1.xyz", "tmQM_X2.xyz", "Benchmark2_TPSSh_Opt.xyz"]
        samples = []
        for fname in raw_fnames:
            data = read_xyz(p_join(self.root, fname), e_map)
            samples += data

        return samples