For link prediction inference, how can I score every pair of nodes from new unseen graphs with no positive edges?

ogggcar · September 10, 2021, 9:53am

Hi everyone,

I have succesfully trained a R-GCN in link prediction tasks following this tutorial: 5.3 Link Prediction — DGL 0.6.1 documentation. I trained my model on 300 graphs. Graphs and nodes embeddings are generated from Natural Language texts from a corpus. Training edges are human annotated. Graphs look like this:

Graph(num_nodes={‘ent’: 84},
num_edges={(‘ent’, ‘link1’, ‘ent’): 67, (‘ent’, ‘link2’, ‘ent’): 289, (‘ent’, ‘link3’, ‘ent’): 62}

As in the guidelines, Score Predictor code:

class HeteroDotProductPredictor(nn.Module):

    def forward(self, graph, h, etype):
  

        with graph.local_scope():

            graph.ndata['h'] = h['ent'] # h = node representations

            graph.apply_edges(fn.u_dot_v('h', 'h', 'score'), etype=etype)

            return graph.edges[etype].data['score']

Model code:

class Model(nn.Module):
    def __init__(self, in_features, hidden_features, out_features, rel_names):
        super().__init__()
        self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
        self.pred = HeteroDotProductPredictor()
    def forward(self, g, neg_g, x, etype):
        h = self.sage(g, x)
        return self.pred(g, h, etype), self.pred(neg_g, h, etype)

My problem now is that I need to predict links in new graphs not seen during training. This new graphs are also generated from a text and contain embeddings extracted from them, you know. This new graphs are obviously not annotated with the links I need to predict (link1 and link3). A new unseen graph would look like this:

Graphs: Graph(num_nodes={‘ent’: 122},
num_edges={(‘ent’, ‘link1’, ‘ent’): 0, (‘ent’, ‘link2’, ‘ent’): 486, (‘ent’, ‘link3’, ‘ent’): 0})

Now, if I need to predict links in this new graph which doesnt have positive examples, what should I do? I cannot use the same method as during training, since this graphs doesn’t contain positive examples.

I guess I need to add something to my code/model, but Im not sure what that is. My guess is I need to add an “inference function” that allows me to score every pair of nodes of the graph and then take those with a score over some threshold as 'the ones that should be connected by an edge. My problem is I dont know how to use the trained model for this purpose. Should I change my score predictor? No, right? Should I add an “inference function” in my model (maybe inside my “model” class?), that applies the score predictor to every pair of nodes so I cant take those over a thershold? How could I do this?

The output Im looking for is the classic list of src and dst nodes for a given edge, something like: tensor([1,2,3,…]), tensor([2,3,1,…])

I really need to figure this out since it is the last step of my project, so any help would be really appreaciated.

Thank you all so much.

mufeili · September 11, 2021, 11:40am

Based on your example, your unseen new graphs don’t have edges of edge type link1 and link3. However, they do have edges of edge type link2. I assume edges of link2 participate in message passing during training. In that case, you can still apply the trained model.

As you said, you can score each pair of nodes after the RGCN computes the node representations and take node pairs that exceed a score threshold. You will need to determine the threshold based on a held-out validation set. In other words, use a set of graphs with edges of type link1 and link3 unseen during training as the validation set to determine the threshold.

ogggcar · September 11, 2021, 4:49pm

Thank you so much.

Right now I wasnt trainining the model on this link, so I understand I must do the training loop for the 3 links, instead of just the 2 im predicting? If I dont do this I cant inference the other two?

Could you help me please with the code to do that score of every pair of nodes? Could it be done with an inference function?

Thank you again.

mufeili · September 12, 2021, 9:59am

Right now I wasnt trainining the model on this link, so I understand I must do the training loop for the 3 links, instead of just the 2 im predicting? If I dont do this I cant inference the other two?

Since your training graphs have edges of all types, I assume you can train your model to predict edges of all types?

Could you help me please with the code to do that score of every pair of nodes? Could it be done with an inference function?

You just need to take the dot product of pairs of node representations. See also this stack overflow thread.

ogggcar · September 12, 2021, 4:30pm

Thanks again.

Yeah I know, but I should use my Score Predictor to calculate the dot product, since the score should be different for every type of edge and the Score Predictor uses that edge information, right? That’s my main question, I guess.

Else, with a regular dot product similarity between feature vectors of nodes, every pair of nodes would get the same score for different type of edges and that doesn’t makes sense, right? Also, in my training set two nodes are related either by link1 or by link3, but never by both of them, so I absolutelly need the score between two nodes to be different for different edges.

Also related to it, how should I use the model to do that? Or are just the model generated embeddings all I should use from the model?

Thank you so much.

mufeili · September 13, 2021, 3:39am

Yeah I know, but I should use my Score Predictor to calculate the dot product, since the score should be different for every type of edge and the Score Predictor uses that edge information, right? That’s my main question, I guess.

I guess the most straightforward solution is to apply an MLP to the final node representations to have different node representations for computing edge-type-specific predictions.

ogggcar · September 13, 2021, 6:27am

Shouldn’t this be a common task for heterographs when trying to predict new edges? I just cannot find the doc or other related posts, sorry.

I’m afraid I dont know how to do that. Could you help me with the code, please?

Isn’t there another way to do it using just my score predictor? I was thinking of something like this: Running link prediction on disconnected nodes using `EdgeDataLoader` - #6 by BarclayII

But dont know how to implement it for my case.

Thank you so much.

mufeili · September 13, 2021, 8:43am

In the context of user guide 5.3, you can change HeteroDotProductPredictor to the code snippet below:

class MLP(nn.Module):
    def __init__(self, out_feats):
        self.layer = nn.Sequential(
                        nn.Linear(out_feats, out_feats),
                        nn.ReLU(),
                        nn.Linear(out_feats, out_feats)
                     )

    def forward(self, x):
        return self.layer(x)

class HeteroDotProductPredictor(nn.Module):
    def __init__(self, out_feats):
        self.etype_project = {'link1': MLP(out_feats), 'link2': MLP(out_feats), 'link3': MLP(out_feats)}

    def forward(self, graph, h, etype):
        # h contains the node representations for each node type computed from
        # the GNN defined in the previous section (Section 5.1).
        with graph.local_scope():
            graph.ndata['h'] = self.etype_project[etype](h)
            graph.apply_edges(fn.u_dot_v('h', 'h', 'score'), etype=etype)
            return graph.edges[etype].data['score']

ogggcar · September 13, 2021, 10:13am

Thank you so much.

One problem. Now when instantiating the model I get the following error in the Heterodotproduct:

TypeError: __init__() missing 1 required positional argument: 'out_feats'

at:

self.pred = HeteroDotProductPredictor()

I have tried the following:

class Model(nn.Module):
    def __init__(self, in_features, hidden_features, out_features, rel_names):
        super().__init__()
        self.sage = RGCN(in_features, hidden_features, out_features, rel_names)
        self.pred = HeteroDotProductPredictor(out_feats=out_features)
    def forward(self, g, neg_g, x, etype):
        h = self.sage(g, x)
        return self.pred(g, h, etype), self.pred(neg_g, h, etype)

But I get:

AttributeError: cannot assign module before Module.__init__() call

How should I adapt my model to this?

And please, one last thing I dont understand yet. In the end, how can I use this feats and the new score predictor when computing a pair of nodes similarity for a given edge? I guess all I need now is to extract those link specific embeddings, but how could I do it? I undertand that now every node has 3 different vectors, right? Or how this work this new out_feats?

mufeili · September 14, 2021, 4:09am

Sorry, you need to add super().__init__() in MLP and HeteroDotProductPredictor.
As you said, you now have 3 different embeddings per node corresponding to 3 edge types. You just need to take the dot product of pairs of node embeddings per edge type.

ogggcar · September 14, 2021, 6:57am

Not sorry, please. Thank you so much!

I still get an error:

KeyError: ('ent', 'link1', 'ent')

in:

 graph.ndata['h'] = self.etype_project[etype](h)

Any idea why?

But how can I extract those edge specific feats?

mufeili · September 15, 2021, 5:45am

I still get an error:

KeyError: ('ent', 'link1', 'ent')

in:

graph.ndata['h'] = self.etype_project[etype](h)

Any idea why?

You can change 'linki' to ('ent', 'linki', 'ent') in self.etype_project.

But how can I extract those edge specific feats?

As I said, you have 3 MLPs, one per edge type. After you project the node representations, you have 3 types of node representations, corresponding to 3 edge types. When you take the dot product of the representation of node i and j corresponding to edge type k, you get the possibility score of having an edge between i and j of type k.

ogggcar · September 15, 2021, 7:22am

Hi. I got en error:

TypeError: linear(): argument 'input' (position 1) must be Tensor, not dict

but I solved it changing

graph.ndata['h'] = self.etype_project[etype](h)

for

graph.ndata['h'] = self.etype_project[etype](h['ent'])

Now it works perfect. I did the same with my old predictor, since my graph it’s a heterograph with just one node type. It makes sense, right? Got it from this post: Edge Classification with one node type - #5 by BarclayII

Sorry but don’t know yet how to get this edge specific and updated embeddings for a given node from an unseen graph. Remember that this new graphs don’t contain any link1 and link3, so no positive edges as during training, so I cannot compare with the negative ones. In conclusion, I would need to:

pass this new unseen graph with no link1 and link3 though the model to update its embeddings
then extract the new edge specific embeddings.

Any code help, please? I dont really know how to do it. If I could do something like this

updated_features = model(graph, etype)

everything would be solved, and the I could easily calculate the similarity of every pair of model and edge updated embeddings.

Also and sorry for this too much text, just realized, quite stupid from me: this dot product similarity has a way to solve unidirectionality? In my case, link1 must be directional and exclusive, so if there is a link1 between u and v then it cannot exist a link1 between v and u, but the dot product will be the same in both directions, right? For link1 the score of u,v must be different than for v,u. For the rest of links it doesn’t matter. Can I solve this?

mufeili · September 16, 2021, 5:42am

Now it works perfect. I did the same with my old predictor, since my graph it’s a heterograph with just one node type. It makes sense, right? Got it from this post: Edge Classification with one node type - #5 by BarclayII

That sounds fine.

Any code help, please?

The code snippet you have now will also work for this case. All you need is to update node representations with a GNN and then score pairs of nodes for each edge type. It doesn’t matter whether you have links only for link1, link2, or link3 during the test time.

this dot product similarity has a way to solve unidirectionality? In my case, link1 must be directional and exclusive, so if there is a link1 between u and v then it cannot exist a link1 between v and u, but the dot product will be the same in both directions, right? For link1 the score of u,v must be different than for v,u. For the rest of links it doesn’t matter. Can I solve this?

You will then need a different score function. You cannot distinguish the edge direction by dot product. Perhaps @BarclayII can provide a suggestion on score function.

ogggcar · September 16, 2021, 6:01am

Thanks again, @mufeili

But which is the specific code snippet to update node representations of new unseen graphs with the GNN model?

But before this what is the code to extract the edge specific embeddings that I need to calculate the edge especific similarity between two nodes? Right now I just know: feats_node_1 = g.ndata['feats'][1], which isn`t edge specific.

It’s the code part of all this what I’m missing, but I understand the process.

mufeili · September 17, 2021, 3:41am

But which is the specific code snippet to update node representations of new unseen graphs with the GNN model?

h = self.sage(g, x)

But before this what is the code to extract the edge specific embeddings that I need to calculate the edge especific similarity between two nodes? Right now I just know: feats_node_1 = g.ndata['feats'][1] , which isn`t edge specific.

self.etype_project[etype](h['ent'])

is edge-type specific.

ogggcar · September 18, 2021, 4:47pm

But this “self” code must be used inside a function inside my model class, right? It is not the code I should use directly with a new graph, or is it? I mean, I cannot use h = self.sage(g, x) in, lets say, a new Colab cell. Am i wrong?

mufeili · September 19, 2021, 9:50am

Assume you have a trained model that you want to apply, you will need to save the learned model parameters, which can be used later. See this PyTorch tutorial.

ogggcar · September 19, 2021, 3:12pm

Yes, I know this. What I mean is that even though it is a just trained model or a saved one, the self.something statement cannot be used outside the class definition, right?

I mean, lets say I have a graph ‘g’ like the ones I told you. If a wanted to pass it through the sage method, shouldn’t I do something like model.sage(g, x) instead of self.sage(g, x), which I think is only used when creating the class, right? Same with self.etypeproject.

Maybe Im wrong, sorry in that case.

I mean, how can I, once I have already trained, saved and loaded my model, apply it and the two previous snippets (or something similar) to a new graph? Its the ‘self’ part what confuses me when I try to use it directly with a graph.

Thanks again.

mufeili · September 20, 2021, 7:13am

Yes, I know this. What I mean is that even though it is a just trained model or a saved one, the self.something statement cannot be used outside the class definition, right?

No, you can do

model = Model(...)
model.sage(...)