Distributed DGL training using MPI

pranjaln · April 21, 2023, 10:02pm

I am doing distributed node classification using torch.distributed with MPI backend for a personal project. I have two doubts on how to go about this -

I want to do data-parallel training on disjoint partitions of the graph i.e. each worker node on a cluster only sees its own partition and trains a local model before syncing after each epoch. How do I partition my graph data (say OGBN-Products) onto different machines? Is there a DGL API that supports this? Or is manually partitioning the way to go?
If I want to validate my model at the end of each epoch, what would be the way to go about this? Since I want to validate the model globally on the global validation data.

czkkkkkk · April 23, 2023, 3:05am

Hi @pranjaln ,

You can try DGL graph partition API. And the example usage can be found in https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/dist/partition_graph.py
I feel there could be two options. (1) Put all the validation data on a single machine for validation. (2) Partition the validation data to all machines and aggregate the validate results to a single machine via torch distributed communication.

system · May 23, 2023, 3:06am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.