Link Prediction on Undirected Heterograph

Hi! I’m new to Graph ML and GNNs and trying to perform link prediction with neighbor sampling on bipartite undirected transaction graph consisting of credit card and merchants nodes with edges are transactions as shown in the figure. The goal is to predict which credit cards and merchants should be connected. To make the heterograph undirected I add the reverse edge like t
his:

dgl.heterograph({('card', 'transaction', 'merchant'): (card_nodes, merch_nodes), ('merchant', 'transaction-rev', 'card'): (merch_nodes, card_nodes)})

I’m trying to follow the tutorial here 6.3 Training GNN for Link Prediction with Neighborhood Sampling — DGL 0.6.1 documentation but I have few questions:

  1. I’m confused whether I should be trying to predict just the transaction edge or both transaction and its reverse edge. Thus I’m confused about whether I should be passing only ('card', 'transaction', 'merchant') edge in train_eid_dict or should I also pass in the reverse edge ('merchant', 'transaction-rev', 'card') in the EdgeDataLoader .

  2. Also, I need to do negative sampling since this is link prediction and I want the negative edges to be between a card and a random merchant that are not actually connected in the graph and similarly between a merchant and a random card that are not connected. Will using dgl.dataloading.negative_sampler.Uniform accomplish this?

  3. I have also been going through EdgeDataLoader doc and the examples there dgl.dataloading — DGL 0.6.1 documentation where they exclude the reverse edge in heterograph. If I’m trying to do undirected link prediction should I be excluding the reverse edge?Any help/guidance here would be really appreciated! Thanks!

  4. Finally, I want to do train-validation-test split but not sure how I should proceed. DGL docs don’t seem to have an example for doing the split for link prediction with neighbor sampling. I suppose I could randomly sample 80% of edges as training, 10% validation and 10% test but again how to deal with the reverse edge for my undirected link prediction is what I’m confused about. Also it seems like I should be using g_sampling somehow for the validation and test set but not sure how because there is no example in the DGL docs. Can someone please guide on this?

  1. How about training on one single edge type: merchant, paid, card ? Just follow the example in tutorial: 5.3 Link Prediction — DGL 0.6.1 documentation. Is there any difference between unidirectional and bidirectional in theory? Adding the reverse edge is really necessary?
  2. If both directions are required which means we have two distinct edge types, then train each edge type one by one and add scores together?
  1. My understand is that if I only have the ('card', 'transaction', 'merchant') edge then the card nodes won’t get any messages from merchant nodes and won’t get updated through the different layers of R-GCN. Only the merchant nodes would be receiving messages in that case. I want both card and merchant nodes to receive messages. Also I did see that example 5.3 Link Prediction — DGL 0.6.1 documentation but it’s for homogeneous graph and doing full graph training. My graph is quite large and will need neighbor sampling. I do like how they evaluate AUC on test set in full-graph fashion. It would be nice to see how to evaluate AUC on validation/test set when doing minibatch training.

So, in your case, you could train on heterograph with bidirectional edges and predict unidirectional edge(namely one edge type) only. In this way, both merchants and cards could receive messages from each other in train stage.

below are replies for your questions:

  1. predict one edge type(namely, one direction) should be ok, as the reverse is the same. But both edge types should be passed into EdgeDataLoader so that both edge types info could be utilized when train.
  2. Yes. negtaive_sampler could sample negative samples, though it’s has very low probability of sampling connected edges in large graph.
  3. yes. reverse edges should be excluded. just follow the examples in dgl.dataloading — DGL 0.7 documentation.
  4. dataset split is not related to neighbor sampling. g_sampling is for neighbor sampling and not what you need. You should split dataset in original unidirectional graph and convert them into bidirectional namely heterographs then.
1 Like