Negative Sampling for Link Prediction

lerachel9900 · August 11, 2020, 6:18pm

I’m utilizing GNN for link prediction in protein. For example, my current graph is a uni-directed connected graph starting from a beginning atom all the way to an ending atom. There is only one way to traverse the graph. You can think of it as a link list. The dataset would contain thousands of protein graph. I expect to train on multiple disconnected graphs then test the model on unseen graphs (only nodes, the GNN does the linking)

In this case, there are only (# nodes - 1) possible edges. However, the amount of negative edges are huge. I’m reading about EdgeSampler from https://docs.dgl.ai/en/0.4.x/api/python/sampler.html but still unsure the best settings for my case?

Does anyone have any suggestion for link prediction, especially strategies for negative sampling?

Thank you!

mufeili · August 13, 2020, 7:22pm

In terms of graph topology, essentially you want to predict links for chains. I don’t think GNNs will work well for chains as there is little information in graph structure. If I am going to tackle this problem myself, I will probably start with some generative models that generate chains from a random root.

lerachel9900 · August 13, 2020, 7:39pm

How about RGCN example from DGL? I’ve been looking at the code.

My team builds a product to predict locations of all atoms in a protein from a protein density maps with high accuracy. However, connecting all the atoms together into a chain that matches the protein 1D sequence is difficult and currently it is done manually (non ML approach). So I want to try training GNN to connect the atoms for me.

With Generative model, I can also feed it a bunch of nodes (as atoms) and it will generate the links between those atoms? We also don’t know the beginning and end of the protein’s atoms chain. So using a random root means the graph has to be nondirected.

mufeili · August 14, 2020, 4:32am

RGCN requires multiple types of edges. Do you have that in your case? Even you do have multiple relations, GNNs tend to be little help in the case of chains.

Since you have predicted locations of atoms, can we assume that only atoms that are close can be connected? If so the pairwise distance between atoms may be used as some features.

For generative models, you can sequentially pick the node to connect for another node. You can refer to some sequence generation literature for ideas, e.g. GraphRNN.