Problem of RGCN batch sampling

Great Library!
I have one question about link_prediction.py in RGCN. More specifically, in codes below

g, node_id, edge_type, node_norm, data, labels = \
            utils.generate_sampled_graph_and_labels(
                train_data, args.graph_batch_size, args.graph_split_size,
                num_rels, adj_list, degrees, args.negative_sample)

g, node_id, edge_type, node_norm are the subgraph used to pass messages. (used for encoding). data and label are used to calculate loss based on the forward embeddings after message passing.(used for decoding and cal loss on this batch)

Why the edges in g are included in ‘data’, which means you used seen edges to predict this seen edges and propagate the loss. I am confused about this. Thanks very much for help!

Hi @Lee-zix,

That’s great question. You must be worrying about the ground truth leakage issue. You are right that the seen edges in the graph structure are also positive samples to be predicted. But the seen edges only account for half of the positive samples (see here). I guess maybe the R-GCN paper authors wish to have the model be able to predict both the seen edges and unseen positive edges. But yes, half of the positive samples are leaked.

Personally, I don’t know what’s the best practice. If you wish to make sure there is no leakage at all, then you can just remove the seen edges and only use the remaining unseen edges as positive samples. The Star-GCN paper has a discussion (section 3.4) about the linkage issue, which you may want to check out.

Thanks very much for your reply! It helps a lot!