Link prediction questions: train-test, accuracy and prediction

ogggcar · July 22, 2021, 9:36am

I have implemented this Link Prediction: model, but I have some doubts when fitting it to my graph: 5.3 Link Prediction — DGL 0.6.1 documentation

Why is it not divided between Training and Testing? If I wanted to, how would I apply it in the last step, which involves calculating the loss?

Let’s say I have these splits:

numero_nodes = g.num_nodes('ent')
n_train = int(numero_nodes * 0.8)
train_mask = torch.zeros(numero_nodes, dtype=torch.bool)
test_mask = torch.zeros(numero_nodes, dtype=torch.bool)
train_mask[:n_train] = True
test_mask[n_train:] = True
g.ndata['train_mask'] = train_mask
g.ndata['test_mask'] = test_mask

How can I use these splits when training and testing the model with the following code? (is this even the right way to split train-test for link prediction?

def compute_loss(pos_score, neg_score):
    # Margin loss
    n_edges = pos_score.shape[0]
    return (1 - pos_score.unsqueeze(1) + neg_score.view(n_edges, -1)).clamp(min=0).mean()

k = 5
model = Model(10, 20, 5, hetero_graph.etypes)
user_feats = hetero_graph.nodes['user'].data['feature']
item_feats = hetero_graph.nodes['item'].data['feature']
node_features = {'user': user_feats, 'item': item_feats}
opt = torch.optim.Adam(model.parameters())
for epoch in range(10):
    negative_graph = construct_negative_graph(hetero_graph, k, ('user', 'click', 'item'))
    pos_score, neg_score = model(hetero_graph, negative_graph, node_features, ('user', 'click', 'item'))
    loss = compute_loss(pos_score, neg_score)
    opt.zero_grad()
    loss.backward()
    opt.step()
    print(loss.item())

If I wanted to predict 2 relationships types at the same time, I understand that I would have to construct the negative graph to contain negative examples of both, but when using the model I see that I cannot pass a list of the type [(node_type, edge_type_1, node_type), (node_type, edge_type2, node_type)]. Is there any way to do it?
At the moment I can only calculate Loss and AUC. How can I also calculate the Accuracy? Is there a predefined function somewhere?
Once the model is trained, how can I use it to predict in new graphs which doesn’t contain those edges? Is there somethin like model.predict(new_graph)?

Thank you all.

VoVAllen · July 26, 2021, 7:25am

Hi,

For the first question, the link prediction didn’t need to split training/test set. In the semi supervised inductive node classification setting, only the label of the test nodes is invisible during the training phase, but the topology and node itself is visible. Topology is the same for the training and test phase, in the inductive setting. If you want to mask out some link to avoid information leakage as transductive settings, you can use edge_subgraph functions to construct a subgraph.

For the second question, you can passed the type each separately, and sum the loss for each type together.

For the third and fourth question, the pos_score is the logits for each edges. You can simply do something like (pos_score>0.5).mean() to get the accuracy. And the inference function is just calculate the pos_score

ogggcar · July 26, 2021, 10:22am

Thanks you so much, really helpful.

One last thing. What does that pos_score>0.5 mean?

Also, about the inference. The POS score is a tensor, right? I dont fullly understand how could I use it to predict new relation when passing a Graph trought the model. How could I get those predicted links as an output?

Thanks again.

VoVAllen · August 2, 2021, 8:33am

For the link prediction problem, it’s a binary classification, and the final score is a number between 0 and 1. >0.5 means it’s positive.

After pos_score, neg_score = model(hetero_graph, negative_graph, node_features, ('user', 'click', 'item'))

pos_score should be a (n*n,) tensor where n is number of nodes, and you can reshape it to (n, n) and apply >0.5 to get the adjacency matrix.
Something like (pos_score>0.5).view(n, n)

ogggcar · August 3, 2021, 9:26am

Thank you so much. Really helpful.

ogggcar · August 8, 2021, 6:10pm

Sorry to bother you again. Coming back to the train - test split… then I could just use my training loss as the evaluation of the model? Would it be neccesary to add AUC or accuracy during training? Thanks

VoVAllen · August 9, 2021, 6:11am

You can add AUC and accuracy to select the hyperparameters. However for the optimizing stage, you can not optimize them directly

ogggcar · August 9, 2021, 8:04am

Thanks again. Any doc reference about the optimizing stage?

VoVAllen · August 9, 2021, 8:38am

I mean the normal loss.backward(); optimizer.step() stage. Nothing extra

ogggcar · August 15, 2021, 3:49pm

One more question. What about information leakage? If I dont mask past edges could it be a problem? Concretly which edges should I mask? Right now using your accuracy formula I get around 98% in new graphs and I fear it is too high because of that. Makes sense?
Thank you so much.

VoVAllen · August 16, 2021, 5:43am

It depends on your task. You can take a subgraph of orginial graph as the training set and use other part for validation. Some tasks like node classification can use link prediction as auxiliary task, which doesn’t need to split on the edges

VoVAllen · August 16, 2021, 6:46am

Another point is that when you trying to do prediction on the whole graph, most edges are negative (real edges are only small portions of all the possible edges). Therefore accuracy may not reflect the real performance. You can try AUC score instead

ogggcar · August 17, 2021, 7:32am

Thanks again. Really helpful as usual.

And about the inference, you said it’s the POS score. But given a new graph with no links, how can I get a list or the predicted links, for example in a format like [start_nodes], [end_nodes] or something like that?

VoVAllen · August 17, 2021, 8:45am

You can create a graph with full edges (i.e. each pair nodes has an edge between them). And then calculate the pos score. The edges with pos score>0.5 should be reserved as the inference result

ogggcar · August 17, 2021, 5:29pm

Thanks. Any code reference anywhere?
Also, is there any way of getting just the accuracy of the positive examples?

ogggcar · August 27, 2021, 5:33am

Sorry to bother, @VoVAllen

Could you please help me with the code for inference?

Thank you so much.

VoVAllen · August 27, 2021, 6:15am


combinations = np.meshgrid(np.arange(num_nodes), np.arange(num_nodes))
src, dst = combinations[0].reshape(-1), combinations[1].reshape(-1)
g = dgl.graph((src, dst)) # A complete graph
pos_score, neg_score = model(hetero_graph, negative_graph, node_features, ('user', 'click', 'item'))
pos_idx = pos_score > 0.5

g_new = dgl.graph((src[pos_idx], dst[pos_idx])) # inferenced graph

This is just an example. I haven’t ran it

ogggcar · September 6, 2021, 9:55am

Thanks.

I get the following error:

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

in:

g_new = dgl.graph(src[pos_idx], dst[pos_idx])

Also I have a doubt when adapting it to my code:

negativo = construct_negative_graph(grafo, 5, ('ent', 'link', 'ent'))

node_feats = grafo.ndata['Feats']

pos_score, neg_score = model(grafo, negativo, {'ent': node_feats}, ('ent', 'link', 'ent'))

pos_idx = pos_score > 0.5

How could I use this “pos_idx” without creating a new graph? My graph is a heterograph with 3 types of links, so I dont wanna loose the info about the other link im not predicting.

Also, I see this pos_idx is a tensor with bool values. The thing is if a use this in a new graph with no edges of this type, the pos_score is never going to be >0.5. Am I right?

Thanks again, as always.

VoVAllen · September 7, 2021, 6:37am

This is an algorithm problem which I don’t have answer to it. It’s something you can explore by whether to keep the edges, or use the pos_score as the edge weight to do further message passing on the graph.

Yes

For the error you met, this might due to the shape of src or pos_idx, which you can use tensor.reshape to change it to the same shape

ogggcar · September 7, 2021, 6:51am

Thank you so much.

Mmm I dont fully understand this. Let’s say I do no have edges to keep. I want use the model trained on 300 heterographs with 3 types of links to score new graphs which just include one of those 3 types of links, so I want to predict the other two types of links. With this method of pos_score, neg_score im using, a graph with no links of that type will never even include pos_score, since there arent positive examples. So I dont have the chance to keep those edges, since they dont exist.

Is there a way to fix this? Should I use another score predictor or another method? Like im not sure about how to do it, but I may need to score every pais of nodes from this new graph, in order to keep those with a higher score as that which have an edge, right?

Thanks again.

Any code help for this?