Parallelizing DGL across compute nodes

With PyTorch, ray tune/train can be used to connect multiple nodes into one cluster for parallelization. However, there seems to be limited information on how to do this for DGL (there was a previous question about ray with dgl). Is there an example of how to connect multiple compute nodes with the corresponding dgl implementation?

for now, torch.distributed.launch is used to help launch distributed train in DGL which works across multiple nodes. See more details here.

could you elaborate more about your use case for parallelizing DGL across multiple compute nodes?

I’m using my university’s computing cluster where using multiple compute nodes requires connecting them into a cluster via a resource such as ray. I’m trying to use >1 compute nodes to speed up dgl.nn.GraphConv training within a DGCNN.

are you training on CPU/GPU with torch.nn.parallel.DistributedDataParallel?

Are you working on node/link prediction on a single large graph or graph prediction on multiple graphs?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.