Understanding minbatching dataloader for link prediction

neo · June 15, 2022, 10:48pm

The minbatch link prediction tutorial creates train dataloader as

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
sampler = dgl.dataloading.as_edge_prediction_sampler(
    sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5))
dataloader = dgl.dataloading.DataLoader(
    g, train_seeds, sampler,
    batch_size=args.batch_size,
    shuffle=True,
    drop_last=False,
    pin_memory=True,
    num_workers=args.num_workers)

The actual transductive link prediction requires test/val edges not present during neither supervision nor message passing in the training stage, but from what I understand the message passing blocks generated from the above dataloader does contain test/val edges of g?

mufeili · June 16, 2022, 4:05am

I believe this PR is related to your issue: [Feature] Add an optional argument to always exclude given edge IDs in as_edge_prediction_sampler by BarclayII · Pull Request #4114 · dmlc/dgl · GitHub

neo · June 16, 2022, 6:49am

Thank you.

What does the exclude = self indicate in edge_prediciton_sampler?
it says excluding the edges in the current minibatch. in docs, and these edges are then passed to the sampler to exclude during neighbourhood sampling, but why would we want to exclude the minibatch edges which we are training upon, as I guess these minbatch edges are coming from train_seeds above?

From this answer, I see that exclude is used to exclude those edges that will not be included in supervision, but only for message passing, but then what all other edges are coming up in the message passing set from the dataloader?

mufeili · June 17, 2022, 1:47am

What does the exclude = self indicate in edge_prediciton_sampler?
it says excluding the edges in the current minibatch. in docs, and these edges are then passed to the sampler to exclude during neighbourhood sampling, but why would we want to exclude the minibatch edges which we are training upon, as I guess these minbatch edges are coming from train_seeds above?

I think the motivation is that for the edges you want to predict in link prediction, they should not be visible in model computation. Otherwise, this is a kind of label leakage.

From this answer, I see that exclude is used to exclude those edges that will not be included in supervision, but only for message passing, but then what all other edges are coming up in the message passing set from the dataloader?

I do not understand the question.

neo · June 17, 2022, 7:05am

Yeah, so for exclude=self

it excludes all those edges which are sampled from train_seeds for making training graphs during each batch, which will be good for avoiding label leakage as you said.
and then finally use these excluded edges only for loss calculation during each batch processing.

so, the possible message passing edges for that batch were all other edges of the graph for the above dataloader code?
i.e., message passing edges = set(edges of edge-induced subgraph for that batch’s nodes, which are obtained from sampled train_seeds eids) - (supervision edges sampled above for that minbatch from train_seeds eids)

so, all these message passing edges are then present in MFGs of that batch.

while pos_g and neg_g’s generated from the dataloader batch are edge-induced subgraph’s with pos&neg supervision edges only, which is also the edges that is excluded in 1 above.
Is my thinking ok?

mufeili · June 18, 2022, 10:29am

Are negative edges also excluded? @BarclayII

BarclayII · June 20, 2022, 3:42am

They are not excluded by default.

neo · June 24, 2022, 8:47pm

Thank you for the response.

system · July 24, 2022, 8:48pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.