MultiLayerNeighbourSampling with a fixed number of input nodes

smfsamir · June 27, 2022, 1:29pm

Hi,

I was wondering if it was possible to use a MultiLayerNeighbourSampler where the number of input
nodes stays fixed? I believe this is what is done in the original GraphSAGE paper: “In this work, we uniformly sample a fixed-size set of neighbors, instead of using full neighborhood sets in Algorithm 1, in order to keep the computational footprint of each batch fixed.”

Currently, the number of input nodes can change for different minibatches with MultiLayerNeighbourSampler:

import dgl
from dgl.data import citation_graph as citegrh
import numpy as np

data = citegrh.load_cora()
graph = data[0]
adj = graph.adj(scipy_fmt='coo')
graph = dgl.graph((adj.row, adj.col)).to('cuda')

train_mask = torch.BoolTensor(data.train_mask)
sampler = MultiLayerNeighborSamplerReplace([3, 3])
train_nids = (torch.arange(0, graph.number_of_nodes())[train_mask]).to('cuda')
dataloader = dgl.dataloading.DataLoader(
    graph, train_nids, sampler,
    batch_size=32,
    shuffle=True,
    drop_last=False,
    num_workers=0)

loader_iter = iter(dataloader)

input_nodes, output_nodes, mfgs = next(loader_iter)
print(len(input_nodes)) # 258


input_nodes, output_nodes, mfgs = next(loader_iter)
print(len(input_nodes)) # 222

Rhett-Ying · June 28, 2022, 9:56am

the sampled number of input nodes varies due to some nodes has less incoming nodes than the fanout you specify?

smfsamir · June 28, 2022, 1:40pm

I don’t think so, the MultiLayerNeighbor uses sampling with replacement; see here: dgl.dataloading — DGL 0.7.2 documentation. I think the variance actually comes from overlap between the neighbourhoods of different nodes.

Also, I think I misunderstood the GraphSAGE paper – they use a fixed-size set of neighbours (i.e., fanout). However, because the neighbourhoods might overlap, the minibatch sizes won’t necessarily be the same. Thanks to Chang Liu for pointing this out (on Slack).

system · July 28, 2022, 1:40pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.