Hi DGL community,
Recently I was trying to run distributed training (dgl/train_dist.py at master · dmlc/dgl · GitHub) on a 4-GPU machine.
When I run with 1 GPU and 2 GPUs, it worked fine. However, when I try 4 GPUs, it has the following error:
“dgl._ffi.base.DGLError: Cannot assign node feature “h” on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.”
Can someone please advise on what might be the reason?
All the best,
zyu