I am doing distributed node classification using torch.distributed with MPI backend for a personal project. I have two doubts on how to go about this -
-
I want to do data-parallel training on disjoint partitions of the graph i.e. each worker node on a cluster only sees its own partition and trains a local model before syncing after each epoch. How do I partition my graph data (say OGBN-Products) onto different machines? Is there a DGL API that supports this? Or is manually partitioning the way to go?
-
If I want to validate my model at the end of each epoch, what would be the way to go about this? Since I want to validate the model globally on the global validation data.