Parallelizing DGL across compute nodes

Deerjason · April 18, 2023, 6:27am

With PyTorch, ray tune/train can be used to connect multiple nodes into one cluster for parallelization. However, there seems to be limited information on how to do this for DGL (there was a previous question about ray with dgl). Is there an example of how to connect multiple compute nodes with the corresponding dgl implementation?

Rhett-Ying · April 18, 2023, 7:58am

for now, torch.distributed.launch is used to help launch distributed train in DGL which works across multiple nodes. See more details here.

could you elaborate more about your use case for parallelizing DGL across multiple compute nodes?

Deerjason · April 18, 2023, 8:39am

I’m using my university’s computing cluster where using multiple compute nodes requires connecting them into a cluster via a resource such as ray. I’m trying to use >1 compute nodes to speed up dgl.nn.GraphConv training within a DGCNN.

Rhett-Ying · April 18, 2023, 9:44am

are you training on CPU/GPU with torch.nn.parallel.DistributedDataParallel?

minjie · April 19, 2023, 4:16am

Are you working on node/link prediction on a single large graph or graph prediction on multiple graphs?

system · May 19, 2023, 4:17am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.