Hi,
I am building a recommender system using GNN and DGL. My graph is heterogeneous : I have 3 types of nodes (‘user’, ‘item’, ‘sport’), and 6 types of relations (user - buys - item, item - boughtby - user, user - practices - sport, etc.).
I split my dataset into training, validation and testing sets. After training my model, I want to do inference on all the test set. The test set includes a few hundreds of users, and all the items & sports. I included all the items & sports because for my metrics (P@K, mostly), I need to compute similarity between a given user in the test set and all the items.
I am trying to get embeddings for all the nodes in my test set. I use a nodeloader, for which I have a dictionary that includes the few hundreds of users nodes id and all the ids for items and sports. I first create a ‘placeholder’ in which the new embeddings will be placed. Everything works well until the last batches of the nodeloader. Here is what I did :
nodeloader_test = dgl.dataloading.NodeDataLoader(infer_g,
test_nids_dict,
sampler,
batch_size,
shuffle,
drop_last,
num_workers)
y = {'user': torch.zeros(g.num_nodes('user'),out_dim),
'item':torch.zeros(g.num_nodes('item'),out_dim),
'sport':torch.zeros(g.num_nodes('sport'),out_dim)}
def inference(y, trained_model, nodeloader_test):
for input_nodes, output_nodes, blocks in nodeloader_test:
trained_model.eval()
with torch.no_grad():
input_features = blocks[0].srcdata['features']
h = trained_model.get_repr(blocks, input_features)
try:
y['item'][output_nodes['item']] = h['item']
except: pass
try:
y['user'][output_nodes['user']] = h['user']
except: pass
try:
y['sport'][output_nodes['sport']] = h['sport']
except: pass
return y
With my try / except syntax, no errors get thrown.
However, in the final batches, there was an output_nodes batch that include 4 users, but in the h, there were no users. Thus, in my final y, I have less users with actual node embeddings than there were users in my test set.
How would you recommend inferring embeddings for nodes in the test set?
If this helps, here is how I compute my test node ids and the graph on which I do inference.
eids = np.arange(g.number_of_edges('buys'))
test_eids = eids[int(len(eids) * (train_size+valid_size)):]
test_users, _ = g.find_edges(test_eids, etype=etype)
test_items = np.arange(g.number_of_nodes('item'))
test_sports = np.arange(g.number_of_nodes('sport'))
test_nids_dict = {'user':torch.unique(test_users).numpy(),'item':test_items, 'sport':test_sports}
infer_g = g.clone()
infer_g.remove_edges(np.append(valid_eids_dict['buys'], test_eids_dict['buys']), 'buys')
infer_g.remove_edges(np.append(valid_eids_dict['buys'], test_eids_dict['buys']), 'bought-by')
Thanks a lot in advance!