Train-test split for recommender systems metrics

Hi,
I am trying to build a recommender system using GNN. As usual on ML tasks, I want to split my dataset in train, test and validation sets.

Following the example of WWW20 Hands on Tutorial here, I consider splitting based on the edge ids.

eids = np.random.permutation(g.number_of_edges())
train_eids = eids[:int(len(eids) * 0.8)]
valid_eids = eids[int(len(eids) * 0.8):int(len(eids) * 0.9)]
test_eids = eids[int(len(eids) * 0.9):]
train_g = g.edge_subgraph(train_eids, preserve_nodes=True)

Then, when I train my model, it would be only on the train_g :

...
loss = model(train_g, features, neg_sample_size)
... 

And when I report performance, it would be on the full graph, with only the test_eids :
acc = LPEvaluate(gconv_model, g, features, test_eids, neg_sample_size)

I have two questions regarding this architecture.

  1. The ‘training signal’ in link prediction is the connectivity between 2 nodes, i.e. nodes connected by edges are similar, while nodes not connected by edges are dissimilar. However, when splitting the dataset in the way presented here, the ‘training signal’ is included in the graph g. Same goes for the train_g : one might consider that the network is only learning to predict high scores for nodes that are connected in the graph. Is this correct?

  2. When computing metrics for recommender systems, we try to replicate as much as possible what would happen in a real setting. However, in this case, we report accuracy using the full graph g. This full graph already includes edges in the test set, i.e. ‘training signal’. This would mean that, in a user-item recommender systems, we would already know which items a user will click on, and we could use that information to correctly predict those items. How to insure that the model does not have access to that information?

One could propose that the inference is done on the train_g directly, where there is no information on the training edges. However, this would create different settings for the training set (where we have all the edges in the graph) and the test set (where we only have ‘past’ edges in the graph), which might lead to poor generalization.

Thanks a lot in advance!

The ‘training signal’ in link prediction is the connectivity between 2 nodes, i.e. nodes connected by edges are similar, while nodes not connected by edges are dissimilar. However, when splitting the dataset in the way presented here, the ‘training signal’ is included in the graph g . Same goes for the train_g : one might consider that the network is only learning to predict high scores for nodes that are connected in the graph. Is this correct?

Yes, that’s possible. For the potential generalization issue, the following two approaches might help:

  1. Negative sampling
  2. Training based on neighbor sampling

When computing metrics for recommender systems, we try to replicate as much as possible what would happen in a real setting. However, in this case, we report accuracy using the full graph g . This full graph already includes edges in the test set, i.e. ‘training signal’. This would mean that, in a user-item recommender systems, we would already know which items a user will click on, and we could use that information to correctly predict those items. How to insure that the model does not have access to that information?

For validation and test, you should use the graph with training edges only. More specifically, you can proceed as follows:

  1. Update the representations of all nodes in the graph
  2. For each edge you want to score, pass the representations of its end nodes computed in 1 to some scoring function.

Thanks for the answer @mufeili!

For validation and test, you should use the graph with training edges only. More specifically, you can proceed as follows:

This is what we implemented, thank you for the pointer.

We are using max margin loss, and metrics like recall@K & precision@K. Currently, the loss decreases well for both training & validation set, metrics rise well on training set but stay very low on validation set. We had the intuition that the poor generalization was due to this:

different settings for the training set (where we have all the edges in the graph) and the test set (where we only have ‘past’ edges in the graph), which might lead to poor generalization.

Could you develop a bit more on the two ideas that you proposed for tackling the generalization issue?

  1. Negative sampling

We are creating positive and a negative graph with the EdgeDataLoader and a negative sampler. Is this what you had in mind?

  1. Training based on neighbor sampling

To create blocks, we are using MultiLayerFullNeighborSample. Is this what you meant?

Again, thank you for your answers!

We are creating positive and a negative graph with the EdgeDataLoader and a negative sampler. Is this what you had in mind?

Yes.

To create blocks, we are using MultiLayerFullNeighborSample. Is this what you meant?

Yes.

Have you tried other models particularly designed for recommender systems? E.g. GCMC and PinSAGE.

I was using a variant of PinSAGE, but I will now give a try to the GCMC model. Thanks for the references!