How to Add a Dataset to OpenQDC¶
Do you think that OpenQDC is missing some important dataset? Do you think your dataset would be a good fit for OpenQDC? If so, you can contribute to OpenQDC by adding your dataset to the OpenQDC repository in two ways:
- Opening a PR to add a new dataset
- Request a new dataset through Google Form
OpenQDC PR Guidelines¶
Implement your dataset in the OpenQDC repository by following the guidelines below:
Dataset class¶
- The dataset class should be implemented in the
openqdc/datasets
directory. - The dataset class should inherit from the
openqdc.datasets.base.BaseDataset
class. - Add your
dataset.py
file to theopenqdc/datasets/potential
oropenqdc/datasets/interaction/
directory based on the type of energy. - Implement the following for your dataset:
- Add the metadata of the dataset:
- Docstrings for the dataset class. Docstrings should report links and references to the dataset. A small description and if possible, the sampling strategy used to generate the dataset.
__links__
: Dictionary of name and link to download the dataset.__name__
: Name of the dataset. This will create a folder with the name of the dataset in the cache directory.- The original units for the dataset
__energy_unit__
and__distance_unit__
. __force_mask__
: Boolean to indicate if the dataset has forces. Or if multiple forces are present. A list of booleans.__energy_methods__
: List of theQmMethod
methods present in the dataset.
read_raw_entries(self)
->List[Dict[str, Any]]
: Preprocess the raw dataset and return a list of dictionaries containing the data. For a better overview of the data format. Look at data storage. This data should have the following keys:atomic_inputs
: Atomic inputs of the molecule. numpy.Float32.name
: Atomic numbers of the atoms in the molecule. numpy.Object.subset
: Positions of the atoms in the molecule. numpy.Object.energies
: Energies of the molecule. numpy.Float64.n_atoms
: Number of atoms in the molecule. numpy.Int32forces
: Forces of the molecule. [Optional] numpy.Float32.
- Add the dataset import to the
openqdc/datasets/<type_of_dataset>/__init__.py
file and toopenqdc/__init__.py
.
- Add the metadata of the dataset:
Test the dataset¶
Try to run the openQDC CLI pipeline with the dataset you implemented.
Run the following command to download the dataset:
- Fetch the dataset files
openqdc fetch DATASET_NAME
- Preprocess the dataset
openqdc preprocess DATASET_NAME
- Load it on python and check if the dataset is correctly loaded.
from openqdc import DATASET_NAME ds=DATASET_NAME()
If the dataset is correctly loaded, you can open a PR to add the dataset to OpenQDC.
- Select for your PR the
dataset
label.
Our team will review your PR and provide feedback if necessary. If everything is correct, your dataset will be added to OpenQDC remote storage.
OpenQDC Google Form¶
Alternatively, you can ask the OpenQDC main development team to take care of the dataset upload for you. You can fill out the Google Form here
As the openQDC team will strive to provide a high quality curation and upload, please be patient as the team will need to review the dataset and carry out the necessary steps to ensure the dataset is uploaded correctly.