Batching test data for inference

Hi,
I am trying to build a recommender system using GNN and DGL. I have a graph with 3 types of nodes (user, item and sport), and 6 types of edges.

I train my model without problem, using EdgeDataLoader. However, when I want to do inference (i.e. compute embeddings for all nodes in my test set), I run into a size mismatch. I use the NodeDataLoader. It seems that the number of ‘user’ output nodes is not the same as the number or ‘user’ dst_nodes in the last block of my dataloader.

valid_users, _ = g.find_edges(valid_eids, etype=etype)
valid_items = np.arange(g.number_of_nodes('item'))
valid_nids = {'user':valid_users.numpy(),'item':valid_items}

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
dataloader_test = dgl.dataloading.NodeDataLoader(g, valid_nids, sampler,
                                                 batch_size=32, shuffle=True, drop_last=False, num_workers=0)

for input_nodes, output_nodes, blocks in dataloader_test:
    print(blocks[1].num_dst_nodes('user'))
    print(output_nodes['user'].shape[0])

Output:
19
20

Thus, when I try to do message reducing e.g. fn.mean, there is the following error:

DGLError: Expected data to have 20 rows, got 19.

Why is there a mismatch between the number of dst_nodes of the last block, and the number of output nodes? Any resources could help here.

Thanks a lot in advance!

Sounds like a bug in NodeDataLoader. Could you give me an example (preferably minimal) of g and valid_nids that make it fail? The user/item/sport features does not seem to matter.

Also, does this error always happen?

Yes. The error does not always happen: if the batch_size is smaller (around 8-32), it usually doesn’t happen.

Here is the example. It cannot get more minimal than this: when I try removing some more nodes, the error disappears. I feel like there are only errors on big graphs with big batch sizes. The csv corresponding to the interaction matrix are available here (since I cannot upload them in this reply).

import numpy as np
import pandas as pd
import dgl

user_item_df = pd.read_csv('user_item_sample.csv')
user_item_src = user_item_df.ctm_new_id.values
user_item_dst = user_item_df.pdt_new_id.values

item_sport_df = pd.read_csv('item_sport_sample.csv')
item_sport_src = item_sport_df.pdt_new_id.values
item_sport_dst = item_sport_df.spt_new_id.values

user_sport_df = pd.read_csv('user_sport_sample.csv')
user_sport_src = user_sport_df.ctm_new_id.values
user_sport_dst = user_sport_df.spt_new_id.values

g = dgl.heterograph({
          ('user', 'buys', 'item'): list(zip(user_item_src, user_item_dst)),
          ('item', 'bought-by', 'user'): list(zip(user_item_dst, user_item_src)),
          ('item', 'utilized-for', 'sport'): list(zip(item_sport_src, item_sport_dst)),
          ('sport', 'utilizes', 'item'): list(zip(item_sport_dst, item_sport_src)),
          ('user', 'practices', 'sport'): list(zip(user_sport_src, user_sport_dst)),
          ('sport', 'practiced-by', 'user'): list(zip(user_sport_dst, user_sport_src))
})

valid_eids = np.arange(g.number_of_edges('buys')) 
etype = ('user', 'buys', 'item')
valid_users, _ = g.find_edges(valid_eids, etype=etype) # if I do np.arange(g.number_of_nodes('user')), the size mismatch disappear
valid_items = np.arange(g.number_of_nodes('item'))
valid_nids = {'user':valid_users.numpy(),'item':valid_items}

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
dataloader_test = dgl.dataloading.NodeDataLoader(g, valid_nids, sampler,
                                                 batch_size=128, shuffle=True, drop_last=False, num_workers=0)

for input_nodes, output_nodes, blocks in dataloader_test:
    print(blocks[1].num_dst_nodes('user') == output_nodes['user'].shape[0])

Thanks!

It seems that your valid_users have duplicate nodes. You need to first remove the duplicates with torch.unique.

We also missed the requirement that the seed nodes must be unique. Will add a check to enforce that.

Thanks @BarclayII, it worked!