Error when running train_dist.py with multiple GPUs (single node)

Hi DGL community,

Recently I was trying to run distributed training (dgl/train_dist.py at master · dmlc/dgl · GitHub) on a 4-GPU machine.

When I run with 1 GPU and 2 GPUs, it worked fine. However, when I try 4 GPUs, it has the following error:
“dgl._ffi.base.DGLError: Cannot assign node feature “h” on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.”

Can someone please advise on what might be the reason?

All the best,
zyu

Hi,

What’s your detailed configurations? How did you launch the job, is it by our launch.py?

Thank you for your reply!

Yes, I run it by launch.py on AWS g4dn.12xlarge instance.

dgl0.7.1
python3.7
torch1.9.1
OS: linux
DGL installed from conda

–num_trainers 4
–num_samplers 4
–num_servers 1 \