Distributed DGL training using MPI

I am doing distributed node classification using torch.distributed with MPI backend for a personal project. I have two doubts on how to go about this -

  1. I want to do data-parallel training on disjoint partitions of the graph i.e. each worker node on a cluster only sees its own partition and trains a local model before syncing after each epoch. How do I partition my graph data (say OGBN-Products) onto different machines? Is there a DGL API that supports this? Or is manually partitioning the way to go?

  2. If I want to validate my model at the end of each epoch, what would be the way to go about this? Since I want to validate the model globally on the global validation data.

Hi @pranjaln ,

  1. You can try DGL graph partition API. And the example usage can be found in https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/dist/partition_graph.py

  2. I feel there could be two options. (1) Put all the validation data on a single machine for validation. (2) Partition the validation data to all machines and aggregate the validate results to a single machine via torch distributed communication.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.