Link prediction - how to deal with new nodes in test set

Hello DGL!

I’m working on a link prediction problem with my own graph where nodes are firms (e.g., IBM) and edges indicate employee turnover between them, and of course, this is a directed graph by nature.

My current concern is that some nodes unseen in a training set (e.g., 2010) pop up in a test set (e.g., 2011). In other words, my link prediction for 2011 is based on the graph in 2010.

More specifically, for example, there are 200 firms (nodes) on the graph in 2010, but say 20 new firms appeared in the 2011 graph, meaning there are 220 nodes in 2011.

Here, the problem is the fact that the training set and test set have different numbers of nodes (200 vs. 220), and it doesn’t allow evaluating my model on the test set. (the 20 new firms have no prior “feature” vector, given it is predictive modeling.)

So, is there any way in DGL to deal with this concern?


Hi, what you have described is called inductive learning where you train models on one graph but evaluate on another. The first thing you need to make sure is that your model supports inductive setting, i.e., your model shouldn’t have parameters tied to the trained graph (e.g., node embeddings, normalizers based on statistics of training graph, etc.). Secondly, your problem is also more complicated since the newly appeared nodes do not have features. One solution I could think of is to assign zero features first and then use diffusion transform (like the newly added SIGNDiffusion) to create some better features. Once that is done, you should be apply your model to a test graph different than the one used for training.

1 Like

Hello minjie,

Thanks for your tips, and I’ll surely look into those. Just a quick question, for “assign zero features” to new nodes, can we just let the zero features be aggregated with messages from neighbors using graph convolution layers, instead of using “SIGNDiffusion”?

Thanks again!

I was a little confused about your setup. If your test nodes do not have features, then I assume your training nodes should also not have features?