Run distributed training on GPU

ruisizhang123 · January 8, 2022, 7:36pm

Hi Guys,
I’m currently working on distributed training on OGB-Products dataset with 4GPUs following the tutorial here: Distributed Node Classification — DGL 0.7.2 documentation.

However, I just found that the graph and model are on CPU. When I tried to put them to GPU using .to(“cuda”), I got the error that DistGraph don’t support GPU. I wonder if there is a good way to do graph distributed training on GPU?

Thanks for your help!

ruisizhang123 · January 8, 2022, 10:56pm

Just fixed the bug. Basically you need to follow the code here: dgl/examples/pytorch/graphsage/experimental at 4889c5782290f1990c924fbea14ba904a3248231 · dmlc/dgl · GitHub.
Also you need to change the code here: dgl/train_dist.py at 4889c5782290f1990c924fbea14ba904a3248231 · dmlc/dgl · GitHub
to

batch_inputs = blocks[0].srcdata['features'].to(device)

zihao · January 9, 2022, 9:22pm

@ruisizhang123 this is not a bug.

DistDGL v1 was proposed to deal with the case that the whole graph cannot fit into GPU memory, it only loads sampled subgraph and corresponding node/edge features to GPU, and the whole node embeddings and graph structures were stored on CPU and were updated in an async way.

If you are working on a single machine multi-GPU setting you are supposed to follow this tutorial where you don’t need to use DistGraph.

ruisizhang123 · January 10, 2022, 2:43am

Thanks for your reply! I’m working on distributed training. The example code in the tutorial didn’t put the subgraph to CUDA and I was having a hard time fixing the problem. I think the tutorial code in the distributed training here Distributed Node Classification — DGL 0.7.2 documentation might be a little confusing.

VoVAllen · January 10, 2022, 6:21am

Please follow example at dgl/train_dist.py at master · dmlc/dgl · GitHub

system · February 11, 2022, 10:58am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.