Mini-batch at node level

Lgcsimoes · January 19, 2020, 2:31pm

I am learning about GCNs and DGL seems to be a very interesting framework to try multiple convolutional methods. My current objective is to run some models (such as GAT and GraphSAGE) on larger datasets (PPI and Reddit, for instance).

The PPI dataset contain multiple graphs, so the mini-batch can be done on a graph-level, such as in https://github.com/dmlc/dgl/blob/master/examples/pytorch/gat/train_ppi.py

Particularly for the Reddit, I would need some mini-batch implementation on node-level that is able to split the graph for each batch in such a way that:

Sample seed nodes for each mini-batch
Sample required neighbours from each seed node, based on the model requirements (including a predefined sample number of 2-hop neighbours, for instance).

It seems to me that using DGL’s LayerSampler would be enough for this task, however I am not sure how I could use the NodeFlow returned by LayerSampler in a PyTorch DataLoader for the mini-batch training. Is there any example similar to what I need to do?

I don’t have much experience with PyTorch or DGL, so I appreciate your help on this issue!

BarclayII · January 20, 2020, 3:37am

Hi,

NodeFlow is essentially a computation dependency graph where performing message passing one block at a time would ultimately lead you to the representation of the seed nodes.

An example of NodeFlow on node-level minibatch training is at https://github.com/dmlc/dgl/blob/master/examples/pytorch/sampling/gcn_ns_sc.py

Thanks.

Lgcsimoes · January 20, 2020, 7:30pm

Hi, thanks for your help, I will take a look at this example!

BTW, I noticed that you have plans to implement some improvements for sampling instead of using NodeFlow. I will keep an eye on that, it will be an interesting feature!