Running link prediction on disconnected nodes using `EdgeDataLoader`

ozeki · April 26, 2021, 6:32pm

I am training a link prediction model using the tutorial in the docs based on the EdgeDataLoader which is working well. As in the tutorial, the model is trained with a negative sampler.

However, now I want to apply the model to find new connections between specific nodes. This means passing pairs of nodes (u,v) \notin E to the model which are not connected in the original graph. As far as I can tell, the EdgeDataLoader can’t handle this since it selects from existing edges (or edges in a negative graph).

Does it make sense to instead use a loader which iterates on pairs of nodes or is there something I am missing?

I suppose I could build a negative graph containing (u,v) for score computation, and select edges incident to u and v to send to the EdgeDataLoader to do the message passing. However, this seems sloppy.

Thanks for the help!

BarclayII · April 27, 2021, 1:20am

There are two ways to address this.

First, EdgeDataLoader has a g_sampling option. If you give that EdgeDataLoader will only sample neighbors from g_sampling instead of g. So you can build a test graph that contains the test edges for EdgeDataLoader to iterate over, and put the training graph in g_sampling to sample neighbors.

Another way is to first train the model with the normal EdgeDataLoader (i.e. without g_sampling), and then compute the representations of all the nodes before evaluation (either using NodeDataLoader or using exact offline inference like in the GraphSAGE example). Those node representations are then used for computing the link prediction scores for all the test edges.

hockeybro12 · May 26, 2021, 6:52pm

@BarclayII how can I do this? I know how to do the first part of getting the node representations.

When doing this, can I also get probabilities, like what are the 50 most likeliest nodes that could be connected to this particular node with this given edge type (I’m using RGCN)? Or would I have to train edge classification for that (and have a 0/1 classification task that predicts if the edge type I care about is there or not)?

BarclayII · May 31, 2021, 6:32am

Say you obtained the node representations for all nodes in the matrix h and you compute a score by dot product between incident nodes, then given a source node u you can compute the score between u and all other by:

score = h @ h[u]

and you can rank the scores to obtain the most likely nodes.

hockeybro12 · June 1, 2021, 4:05am

By @ you mean matrix multiplication right?

How does this compute the score by edge type? Because I have a RGCN. And the link prediction score for one edge type between two nodes may be different than for those same two nodes with a different edge type?

@BarclayII

And what do you mean by computing the dot product score between incident nodes?

BarclayII · June 1, 2021, 5:09am

Yes.

This is just an example of computing the likelihood of edge existence; the bigger the score, the more likely the edge exist between the two nodes.

In this case an example would be using a bilinear score function: say you want to compute the score between u and v with edge type r, then y = h_u^\top W_r h_v. where W_r is learnable parameter.

hockeybro12 · June 1, 2021, 5:11am

Shouldn’t W_r already be learned when I train the graph for link prediction? How do I use it then?

@BarclayII

BarclayII · June 1, 2021, 5:27am

Typically your task will be “given u and r, predict what is the most likely v”. If your score is bilinear then you can do what is just like the above, but replace the dot product with bilinear:

score = h[u] @ W[r] @ h.t()

hockeybro12 · June 1, 2021, 5:31am

@BarclayII How do I get W[r]? I also messaged you on Slack, I can post the solution here.

hockeybro12 · June 2, 2021, 4:48am

This can be solved by modifying the ScorePredictor when training the graph to instead use a Weight matrix (W[r] mentioned above):

class HeteroScorePredictor(nn.Module):

    def __init__(self, etypes, hidden_dims):
        super().__init__()
        etypes_to_use = {}
        for x in etypes:
            etypes_to_use[x[1]] = 1
        self.W = nn.ParameterDict({etype: nn.Parameter(torch.randn(hidden_dims, hidden_dims)) for etype in etypes_to_use})

    def score(self, edges):
        _, etype, _ = edges.canonical_etype
        s = edges.src['h'][:, None, :] @ self.W[etype][None, :, :] @ edges.dst['h'][:, :, None]
        return {'score' : s}

    def forward(self, edge_subgraph, x):
        with edge_subgraph.local_scope():
            edge_subgraph.ndata['h'] = x
            for etype in edge_subgraph.canonical_etypes:
                if edge_subgraph.num_edges(etype) <= 0:
                    continue
                edge_subgraph.apply_edges(self.score, etype=etype)
            return edge_subgraph.edata['score']

    def score_for_etype(self, e_1, all_embed, etype):
        e_1 = torch.unsqueeze(e_1, dim=0)
        return e_1 @ self.W[etype][None, :, :] @ all_embed.t()```

hockeybro12 · June 2, 2021, 4:54am

@BarclayII

When computing the score for an Edge connecting Node A to B, it seems to me that we are using the representation for Node A (edges.src['h'][:, None, :]) and the representation for Node B (edges.dst['h'][:, :, None]).

However, what if I also wanted to use the representation for a Node C, that was connected to Node A (so not just Node A’s embedding to compute the score between A and B, but also C). How would I store that feature appropriately, given that I am also using EdgeDataLoader to call this HeteroScorePredictor kind of like so: And how would I use it’s feature representation to compute the score?

pos_score = self.pred(g, x)
neg_score = self.pred(neg_g, x)

BarclayII · June 7, 2021, 2:49am

This doesn’t seem immediately clear to me. For instance, what if you have multiple Cs connecting to the same A? Which one would you select?

If you are computing some triplet-based loss (e.g. predict whether A and B are connected given A and C are connected), then:

EdgeDataLoader is not the right tool for you, since it can only iterate over edges (i.e. node pairs) instead of node triplets in your case. You will have to write your own DataLoader similar to EdgeDataLoader. The idea will be similar: iterate over the triplets and obtain a list of MFGs (blocks), but instead of returning a “positive subgraph” and “negative subgraph”, you just return the triplets in their place instead, because the triplets cannot really be represented as a graph.
Instead of using apply_edges, you will need to directly compute the score by yourself. For instance, say if your triplets are (A_id, B_id, C_id) of type A, B and C, then you can get the representations, like:

A_repr = blocks[-1].dstnodes['A'].data['h']
B_repr = blocks[-1].dstnodes['B'].data['h']
C_repr = blocks[-1].dstnodes['C'].data['h']
score = self.mlp(torch.cat([A_repr, B_repr, C_repr], 1))

hockeybro12 · June 7, 2021, 7:01am

It’s based on a dataset, so the dataset would tell me which one to select.

Is there no way to just get the neighbor of the node? Or get the exact node ID of the node I care about and use it’s feature (like you did in the code example)? Because from the dataset I know which C exactly I want to use and I know it is connected to Node A.

BarclayII · June 7, 2021, 7:09am

The neighbor sampling logic is the same. The difference is that EdgeDataLoader will return subgraphs containing the edges you wish to predict, which does not make sense given that you want to predict scores on node triplets. Essentially you will need a new DataLoader that iterates your triplets and has a collate function that takes in your triplets and produces the MFGs (blocks).

PinSAGE and GATNE-T has examples that uses custom DataLoaders.

hockeybro12 · June 7, 2021, 7:11am

Oh I see, you are basically saying that I need to have Node C in the graph as well since otherwise it’s representation won’t be updated through the back propagation? So I need to modify the data loader to include Node C also?

BarclayII · June 7, 2021, 7:17am

Correct.

(Minimum 20 characters limit)

hockeybro12 · June 9, 2021, 7:24pm

@BarclayII

I don’t necessarily want triplets, I just want to use Node C and Node A’s representation to compute the score between A and B. And I know what C should be beforehand. So I just need to incorporate that into the graph. But I still want to do the same standard Link Prediction training (so I need the positive and negative graph). The positive graph would always just have Node C in there also when it has A and B. Likewise, the negative graph would have a Node of Type C when it has a Node of type A and B, and the relation between them, so that particular edge type score could be computed.

To do this, should I be extending EdgeDataLoader then? I don’t think I would need to write my own DataLoader from scratch.

system · July 9, 2021, 7:24pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.