How to create OnDiskDataset for Link Prediction

I have a dataset that I have converted to a heterogenous graph in DGL. The following are the size of my train, test and val graphs when I save them using dgl.save_graph

  • Train ( Approx 150 GB)
  • Test (Approx 30GB)
  • Val (Approx 12 GB)

I want to instead create an ondisk dataset as specified in the documentation so that I can perform stochastic training using GraphBolt. I am following the documentation below:

https://docs.dgl.ai/en/2.2.x/stochastic_training/ondisk_dataset_heterograph.html

Couple of things that are unclear to me, if create the dataset for link prediction, does the metadata.yaml file need to include every single edge type under the train_set, test_set and val_set types. In the documentation, I see the following under the link prediction task

- name: link_prediction
    num_classes: 10
    train_set:
      - type: "user:like:item"
        data:
          - name: seeds
            format: numpy
            path: {os.path.basename(lp_train_like_seeds_path)}
      - type: "user:follow:user"
        data:
          - name: seeds
            format: numpy
            path: {os.path.basename(lp_train_follow_seeds_path)}

It is not clear to me whether this is because this is specific to the use case where the link prediction task is occuring on both user:like:item and user:follow:user edges. To be specific, if I only wanted to predict links for user:like:item, do I need to include user:follow:user as well?

Secondly, in the documentation above, I see the negative edges included for link prediction in the indexes within validation set and test set, is there a way to include them in the training set as well? Can I just make the seeds for user:like:item include both positive and negative edges so that I do not need to perform negative sampling in the pipeline.

  - name: link_prediction
    num_classes: 10
    train_set:
      - type: "user:like:item"
        data:
          - name: seeds
            format: numpy
            path: {os.path.basename(lp_train_like_seeds_path)}

No you don’t have to. You only need to include the edges you want to predict.

There is no restriction on how the training/validation/test data should look like. So you only need to make the training set the same format as the validation/test set, and remove the sample_uniform_negative call in the training data loader definition