Error when running train_dist.py with multiple GPUs (single node)

zyuzyu · October 12, 2021, 7:41pm

Hi DGL community,

Recently I was trying to run distributed training (dgl/train_dist.py at master · dmlc/dgl · GitHub) on a 4-GPU machine.

When I run with 1 GPU and 2 GPUs, it worked fine. However, when I try 4 GPUs, it has the following error:
“dgl._ffi.base.DGLError: Cannot assign node feature “h” on device cuda:1 to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.”

Can someone please advise on what might be the reason?

All the best,
zyu

VoVAllen · October 13, 2021, 5:23am

Hi,

What’s your detailed configurations? How did you launch the job, is it by our launch.py?

zyuzyu · October 14, 2021, 3:18pm

Thank you for your reply!

Yes, I run it by launch.py on AWS g4dn.12xlarge instance.

dgl0.7.1
python3.7
torch1.9.1
OS: linux
DGL installed from conda

–num_trainers 4
–num_samplers 4
–num_servers 1 \

VoVAllen · October 18, 2021, 10:31am

Discussed at use GPU to train GraphSAGE: Cannot assign node feature "h" on device cuda:0 to a graph on device cpu. Call DGLGraph.to() to copy the graph to the same device. · Issue #3422 · dmlc/dgl · GitHub

system · November 17, 2021, 10:31am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.