Best way to do inference on test nodes

jedbl · September 25, 2020, 2:30pm

Hi,
I am building a recommender system using GNN and DGL. My graph is heterogeneous : I have 3 types of nodes (‘user’, ‘item’, ‘sport’), and 6 types of relations (user - buys - item, item - boughtby - user, user - practices - sport, etc.).

I split my dataset into training, validation and testing sets. After training my model, I want to do inference on all the test set. The test set includes a few hundreds of users, and all the items & sports. I included all the items & sports because for my metrics (P@K, mostly), I need to compute similarity between a given user in the test set and all the items.

I am trying to get embeddings for all the nodes in my test set. I use a nodeloader, for which I have a dictionary that includes the few hundreds of users nodes id and all the ids for items and sports. I first create a ‘placeholder’ in which the new embeddings will be placed. Everything works well until the last batches of the nodeloader. Here is what I did :

nodeloader_test = dgl.dataloading.NodeDataLoader(infer_g, 
                                                 test_nids_dict, 
                                                 sampler,
                                                 batch_size, 
                                                 shuffle, 
                                                 drop_last, 
                                                 num_workers)

y = {'user': torch.zeros(g.num_nodes('user'),out_dim), 
     'item':torch.zeros(g.num_nodes('item'),out_dim), 
     'sport':torch.zeros(g.num_nodes('sport'),out_dim)}

def inference(y, trained_model, nodeloader_test):
    for input_nodes, output_nodes, blocks in nodeloader_test:
        trained_model.eval()
        with torch.no_grad():
            input_features = blocks[0].srcdata['features']
            h = trained_model.get_repr(blocks, input_features)
            try:
                y['item'][output_nodes['item']] = h['item']
            except: pass
            try:
                y['user'][output_nodes['user']] = h['user']
            except: pass
            try:
                y['sport'][output_nodes['sport']] = h['sport']
            except: pass
    return y

With my try / except syntax, no errors get thrown.

However, in the final batches, there was an output_nodes batch that include 4 users, but in the h, there were no users. Thus, in my final y, I have less users with actual node embeddings than there were users in my test set.

How would you recommend inferring embeddings for nodes in the test set?

If this helps, here is how I compute my test node ids and the graph on which I do inference.

eids = np.arange(g.number_of_edges('buys'))
test_eids = eids[int(len(eids) * (train_size+valid_size)):]
test_users, _ = g.find_edges(test_eids, etype=etype)
test_items = np.arange(g.number_of_nodes('item'))
test_sports = np.arange(g.number_of_nodes('sport'))
test_nids_dict = {'user':torch.unique(test_users).numpy(),'item':test_items, 'sport':test_sports}

infer_g = g.clone()
infer_g.remove_edges(np.append(valid_eids_dict['buys'], test_eids_dict['buys']), 'buys')
infer_g.remove_edges(np.append(valid_eids_dict['buys'], test_eids_dict['buys']), 'bought-by')

Thanks a lot in advance!

mufeili · September 27, 2020, 5:49pm

It sounds like you want to:

Update the representations of all test user nodes and all item nodes using a trained GNN
Compute similarities between each test user node and all item nodes based on the node representations computed in 1).

If that’s the case, you need to do things as follows:

Rather than using NodeDataLoader to perform sampling-based node representation computation, perform exact node representation computation for all test user nodes and all item nodes. During inference with sampling, multi-layer blocks are very inefficient because lots of computations in the first few layers are repeated. One example will be the exact inference for a GraphSAGE trained with neighbor sampling. Of course, your case can be more complex as you need to iterate over both layers, node IDs and node types. In addition, you want to compute only for test nodes for the final-layer representations.

Compute similarity scores by iterating over both test user nodes and all item nodes.