Understanding minbatching dataloader for link prediction

The minbatch link prediction tutorial creates train dataloader as

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
sampler = dgl.dataloading.as_edge_prediction_sampler(
    sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5))
dataloader = dgl.dataloading.DataLoader(
    g, train_seeds, sampler,
    batch_size=args.batch_size,
    shuffle=True,
    drop_last=False,
    pin_memory=True,
    num_workers=args.num_workers)

The actual transductive link prediction requires test/val edges not present during neither supervision nor message passing in the training stage, but from what I understand the message passing blocks generated from the above dataloader does contain test/val edges of g?

I believe this PR is related to your issue: [Feature] Add an optional argument to always exclude given edge IDs in as_edge_prediction_sampler by BarclayII · Pull Request #4114 · dmlc/dgl · GitHub

1 Like

Thank you.

What does the exclude = self indicate in edge_prediciton_sampler?
it says excluding the edges in the current minibatch. in docs, and these edges are then passed to the sampler to exclude during neighbourhood sampling, but why would we want to exclude the minibatch edges which we are training upon, as I guess these minbatch edges are coming from train_seeds above?

From this answer, I see that exclude is used to exclude those edges that will not be included in supervision, but only for message passing, but then what all other edges are coming up in the message passing set from the dataloader?

What does the exclude = self indicate in edge_prediciton_sampler?
it says excluding the edges in the current minibatch. in docs, and these edges are then passed to the sampler to exclude during neighbourhood sampling, but why would we want to exclude the minibatch edges which we are training upon, as I guess these minbatch edges are coming from train_seeds above?

I think the motivation is that for the edges you want to predict in link prediction, they should not be visible in model computation. Otherwise, this is a kind of label leakage.

From this answer, I see that exclude is used to exclude those edges that will not be included in supervision, but only for message passing, but then what all other edges are coming up in the message passing set from the dataloader?

I do not understand the question.

Yeah, so for exclude=self

  1. it excludes all those edges which are sampled from train_seeds for making training graphs during each batch, which will be good for avoiding label leakage as you said.
  2. and then finally use these excluded edges only for loss calculation during each batch processing.

so, the possible message passing edges for that batch were all other edges of the graph for the above dataloader code?
i.e., message passing edges = set(edges of edge-induced subgraph for that batch’s nodes, which are obtained from sampled train_seeds eids) - (supervision edges sampled above for that minbatch from train_seeds eids)

so, all these message passing edges are then present in MFGs of that batch.

while pos_g and neg_g’s generated from the dataloader batch are edge-induced subgraph’s with pos&neg supervision edges only, which is also the edges that is excluded in 1 above.
Is my thinking ok?

Are negative edges also excluded? @BarclayII

They are not excluded by default.

Thank you for the response.