Problem when runing R-GCN in parallel

Hi,
I use the code of relational GCN(pytorch version)https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn, an error occured when i feed in the data.

It happend in the file ‘model.py’ , line 47

h = layer(g, h, r, norm)

File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 180, in forward
g.update_all(self.message_func, fn.sum(msg=‘msg’, out=‘h’))
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/graph.py”, line 2747, in update_all
Runtime.run(prog)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/runtime.py”, line 11, in run
exe.run()
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/ir/executor.py”, line 204, in run
udf_ret = fn_data(src_data, edge_data, dst_data)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/scheduler.py”, line 949, in _mfunc_wrapper
return mfunc(ebatch)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 133, in basis_message_func
msg = utils.bmm_maybe_select(edges.src[‘h’], weight, edges.data[‘type’])
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/utils.py”, line 91, in bmm_maybe_select
return th.bmm(A.unsqueeze(1), BB).squeeze()
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:486

I used two GPUs to train
Pytorch version: 1.2.0
DGL version: 0.4.2

Maybe there are some problems with runing it in parallel?
Could someone help me with the question?
Thanks a lot!

Can you provide more information about your code?
Are you running RGCN with multi-gpu?

For multigpu training, please refer to this PR: https://github.com/dmlc/dgl/pull/1143

I think the problem you encountered is something like: node_features are in cuda:0 while edge features are in cuda:1.

Thanks! I think that’s the problem i encountered.

You should handle the graph features carefully by moving them to different devices when using multi-gpu.

In my code, I use

model = torch.nn.DataParallel(model)

to enable the multi-gpu.

And I input a batch of graph features during taining, i modify the ‘forward()’ function in class ‘BaseRGCN’:

def forward(self, inputs):
    out = torch.clone(inputs)
    for _batch in range(inputs.shape[0]):
        temp = self.layers[0](self.g, inputs[_batch], self.edge_type, self.edge_norm)
        out[_batch] = self.layers[1](self.g, temp, self.edge_type, self.edge_norm)
    return out

And I using following code to upload the edge type:

edge_type = edge_type.cuda()

It works on one GPU, but the error above occured when using multi GPU.
Is there any suggestion to modify the code?
Thanks.

For edge_type = edge_type.cuda(), I think you may need to specify the device ID.

Hello, did you modify the code? I don’t know that R-GCN in DGL supports multi-GPU training.

For anyone who is interested in multi-GPU training, please look at this newly-added example for training GraphSAGE. Extending it to RGCN should be straightforward by replacing the SAGEConv module with a RelGraphConv module. We are also working on a step-by-step tutorial. Please stay tuned.