Question on dataloader (link prediction)

ecoy · July 18, 2023, 4:26pm

Hi

I have a dataloader for a link prediction task as follows:

        sampler = dgl.dataloading.MultiLayerFullNeighborSampler(n_layers)
        sampler = dgl.dataloading.as_edge_prediction_sampler(
            sampler, exclude='reverse_types',
            reverse_etypes={'listened': 'listened-by', 'listened-by': 'listened'},
            negative_sampler=dgl.dataloading.negative_sampler.Uniform(10))

.......

    def train_dataloader(self):
        return dgl.dataloading.DataLoader(
            self.train_graph,
            self.train_idx,
            self.sampler,
            batch_size=self.batch_size,
            #drop_last=False,
            num_workers=0
        )

    def val_dataloader(self):
        return dgl.dataloading.DataLoader(
            self.valid_graph,
            self.val_idx,
            self.sampler,
            device=self.device,
            batch_size=self.batch_size,
            num_workers=0
        )

Based on this dataloader, how should the train_graph and valid_graph should look like?

Are the indices I provide in the dataloader, the supervision edges? In that case, should the train_graph have some of the edges already in the graph and the rest of the train edges would be the indices given to the dataloader as supervision edges?

And for validation, I would have all the train edges labelled on the graph but then have the validation indices hidden and do the same?

Right now I have validation loss that is lower than the train loss which is making me think that there is some kind of a leakage. Please let me know if what I said is correct so I can fix my implementation!

Kind regards,
Ece

dyru · July 19, 2023, 7:27pm

The train_graph contain the overall graph structure, including all positive training edges (positive valid edges are not included).
For validation, you don’t need the edge sampler stated in your code. Stochastic Training of GNN for Link Prediction — DGL 1.2 documentation provides the example evaluation/inference code that may help.
Could you elaborate on your setting of train_graph, valid_graph, train_idx, and valid_idx so that I can analyze the reason for potential leakage?

ecoy · July 19, 2023, 9:11pm

Hi

So I have valid_graph which is the full graph with all the nodes and all the edges.

Then I have train_graph which is all nodes but no validation edges.

The train_idx is essentially all the edges on the train_graph.
The valid_idx is only the validation edges (the ones removed from train_graph)

I use a separate dataloader for validation because I wanted to calculate the loss for them too and I need positive and negative graphs (I use max-margin loss).

I keep getting the validation loss lower than train and I cant figure out why. The edges are as I explained above.

Please let me know if you have any clue and if maybe I’m doing the validation loss incorrectly with the dataloader

Thanks

ecoy · July 20, 2023, 8:41am

I found this link : Training GNN with Neighbor Sampling for Node Classification — DGL 0.8.2post1 documentation

I know this is node classification and my problem is link prediction but here the two dataloaders use the same graph (which has all the nodes) and instead use the indices of the train and validation nodes.

I also tried this but with this train loss increases while validation loss decreases.

system · August 19, 2023, 8:41am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.