Error when running train_dist.py with multiple GPUs (single node)

Hi DGL community,

Recently I was trying to run distributed training (dgl/train_dist.py at master · dmlc/dgl · GitHub) on a 4-GPU machine.

When I run with 1 GPU and 2 GPUs, it worked fine. However, when I try 4 GPUs, it has the following error:
“dgl._ffi.base.DGLError: Cannot assign node feature “h” on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.”

Can someone please advise on what might be the reason?

All the best,
zyu

Hi,

What’s your detailed configurations? How did you launch the job, is it by our launch.py?

Thank you for your reply!

Yes, I run it by launch.py on AWS g4dn.12xlarge instance.

dgl0.7.1
python3.7
torch1.9.1
OS: linux
DGL installed from conda

–num_trainers 4
–num_samplers 4
–num_servers 1 \

Discussed at use GPU to train GraphSAGE: Cannot assign node feature "h" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device. · Issue #3422 · dmlc/dgl · GitHub

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.