Hello ! I have been playing around with train_sampling.py and reddit data set and trying to write an inference code. I intend to do inference on full test set to recreate results in original GraphSAGE papers. Looking at train_sampling.py, it appears to be doing evaluation on all nodes of the entire reddit data set graph (nid in dataloading is being set as all nodes in graph (th.arange(g.number_of_nodes()))).
def inference(self, g, x, batch_size, device,args):
dataloader = dgl.dataloading.NodeDataLoader(
g,
th.arange(g.number_of_nodes()),
sampler,
batch_size=args.batch_size,
shuffle=True,
drop_last=False,
num_workers=args.num_workers)
And later filters test set nodes for accuracy calculation
return compute_acc(pred[val_nid], labels[val_nid])
This causes test set inference time (~48sec) to be way larger than single epoch training (~21sec).
If we strictly want to do inference/evaluate on test set nodes, shouldn’t we set val_nid to:
th.nonzero(~(test_g.ndata['train_mask']  test_g.ndata['val_mask']), as_tuple=True)[0]
# or
th.nonzero(test_g.ndata['test_mask'], as_tuple=True)[0]
and pass val_nid as nid in dataloader instead of th.arange(g.number_of_nodes()). This brings down inference function time to ~12.6sec. But it also reduces test accuracy to 86.5% from 94.5% (which is achieved by train_sampling.py when evaluation is done on all nodes of the graph as opposed to only test set. Same accuracy test accuracy was achieved by authors of graphSAGE paper). If we go further one step, i.e., setting test_g to:
test_g = g.subgraph(g.ndata['test_mask'])
instead of:
test_g = g
in inductive_split() of load_graph.py, we can further bring down inference time to ~5.5 sec with accuracy of 92.78%.
Question:

What is the correct way to do inference strictly on test set? I am trying to reproduce results of the paper (Inference on entire test for graphSAGE with mean aggregator was done in ~1sec with F1 minor score (nothing but test accuracy in train_sampling.py) of 94.5. So I am trying to see how close DGL implementation can come to original TF implementation. Hence it is critical to strictly perform inference on test set nodes.

Also, can you reason fall in accuracy in case when I set nid to test set mask,
nid = th.nonzero(~(test_g.ndata[‘train_mask’]  test_g.ndata[‘val_mask’]), as_tuple=True)[0] or
or
th.nonzero(test_g.ndata[‘test_mask’], as_tuple=True)[0]
Setup:
I trained a model (train_sampling.py) for 10 epochs + inductive with following settings:
argparser.add_argument('dataset', type=str, default='reddit')
argparser.add_argument('aggr', type=str, default='mean')
argparser.add_argument('numepochs', type=int, default=10)
argparser.add_argument('numhidden', type=int, default=128)
argparser.add_argument('numlayers', type=int, default=2)
argparser.add_argument('fanout', type=str, default='10,25')
argparser.add_argument('batchsize', type=int, default=512)
argparser.add_argument('logevery', type=int, default=20)
argparser.add_argument('evalevery', type=int, default=5)
argparser.add_argument('lr', type=float, default=0.01)
argparser.add_argument('dropout', type=float, default=0.0)
argparser.add_argument('numworkers', type=int, default=4,
and dumped the trained model which was loaded in inference code which then called evaluate function.
Thanks !