Hi I’m new to DGL.
I should handle pytorch CNN layers and DGL graph layers simultaneously. And I have a problem to parallelize my model. I tried DataParallel and DistributedDataParallel both.
class MyModel(nn.Module):
def __init__(...):
....
self.gconv=MPNNGNN(Nf_NODE, Nf_EDGE,node_out_feats=N_NODE_OUT)
....
def forward(self,x,bg):
###bg is graph
bg_n=Graph.ndata['x']
bg_e=torch.nn.functional.one_hot(bg.edata['x'].long()).float()
frag_out=self.gconv(bg,bg_n,bg_e)
......
return out
I want each GPU to go through the above “forward pass” individually. And every GPU must see the same graph. In other words, I don’t want my graph to split during the parallelization process.
So I make my code as below:
model = MyModel().cuda(device_ids[0])
model=torch.nn.parallel.DistributedDataParallel(model,find_unused_parameters=True,device_ids=device_ids)
.....
for epoch in range(start_epoch, end_epoch):
for data,a_dx in training_generator:
data=data.to(device=device,non_blocking=True) #" Not for graph, nevermind"
a_dx=a_dx.to(device=device,non_blocking=True)#"Not for graph, never mind"
#"bg is graph"
bg=Glib.return_graph() #"call already prepared graph"
bg=bg.to(device=device,non_blocking=True)
energy = model(coord,bg)
However, I got an error message
energy = model(coord,bg)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "ddp_test.py", line 522, in forward
frag_out=self.gconv(bg,bg_n,bg_e)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dngusdnr1/hmap/develop/se3/ddp/prac_graph.py", line 74, in forward
node_feats = self.project_node_feats(node_feats) # (V, node_out_feats)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "/home/dngusdnr1/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking arugment for argument mat1 in method wrapper_addmm)
I have same error message when I use nn.DataParallel. “RuntimeError: Expected all tensors to be on the same device …”
What should I do…?