Problem when runing R-GCN in parallel

Hi,
I use the code of relational GCN(pytorch version)https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn, an error occured when i feed in the data.

It happend in the file ‘model.py’ , line 47

h = layer(g, h, r, norm)

File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 180, in forward
g.update_all(self.message_func, fn.sum(msg=‘msg’, out=‘h’))
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/graph.py”, line 2747, in update_all
Runtime.run(prog)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/runtime.py”, line 11, in run
exe.run()
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/ir/executor.py”, line 204, in run
udf_ret = fn_data(src_data, edge_data, dst_data)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/scheduler.py”, line 949, in _mfunc_wrapper
return mfunc(ebatch)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 133, in basis_message_func
msg = utils.bmm_maybe_select(edges.src[‘h’], weight, edges.data[‘type’])
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/utils.py”, line 91, in bmm_maybe_select
return th.bmm(A.unsqueeze(1), BB).squeeze()
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:486

I used two GPUs to train
Pytorch version: 1.2.0
DGL version: 0.4.2

Maybe there are some problems with runing it in parallel?
Could someone help me with the question?
Thanks a lot!

Can you provide more information about your code?
Are you running RGCN with multi-gpu?

For multigpu training, please refer to this PR: https://github.com/dmlc/dgl/pull/1143

I think the problem you encountered is something like: node_features are in cuda:0 while edge features are in cuda:1.

Thanks! I think that’s the problem i encountered.

You should handle the graph features carefully by moving them to different devices when using multi-gpu.

In my code, I use

model = torch.nn.DataParallel(model)

to enable the multi-gpu.

And I input a batch of graph features during taining, i modify the ‘forward()’ function in class ‘BaseRGCN’:

def forward(self, inputs):
    out = torch.clone(inputs)
    for _batch in range(inputs.shape[0]):
        temp = self.layers[0](self.g, inputs[_batch], self.edge_type, self.edge_norm)
        out[_batch] = self.layers[1](self.g, temp, self.edge_type, self.edge_norm)
    return out

And I using following code to upload the edge type:

edge_type = edge_type.cuda()

It works on one GPU, but the error above occured when using multi GPU.
Is there any suggestion to modify the code?
Thanks.

For edge_type = edge_type.cuda(), I think you may need to specify the device ID.

Hello, did you modify the code? I don’t know that R-GCN in DGL supports multi-GPU training.

For anyone who is interested in multi-GPU training, please look at this newly-added example for training GraphSAGE. Extending it to RGCN should be straightforward by replacing the SAGEConv module with a RelGraphConv module. We are also working on a step-by-step tutorial. Please stay tuned.

I am trying to do model parallel using torch.distributed as my RGCN model is large. I have also faced this problem that I put g_node_feature and entire g in the different devices, and I got an RuntimeError said that variable are in the wrong device. I am wondering is it mandatory that the graph must be placed in one device? Thanks!

Is this provided tutorial for data parallel or model parallel? If it is for model parallel the model = DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id) in the tutorial should not set device_ids and output_device according to pytorch’s tutorial

Yes, the graph and the corresponding tensors need to be on the same device.

This is for data parallel.

Do you happen to have a tutorial in model parallel? Thanks!

So far we have not tried model parallel so there isn’t a tutorial for this.

Thanks! I am wondering if I have a large rgcn model (~28,000 nodes and ~39,000 edges in 2 type) how could I fit into the memory? I am running the model on 16G V100, which always return the CUDA out of memory issue.

Have you tried sampling-based training? You can find some examples here for mini-batch training.

28k nodes and 39k edges should not exhaust GPU memory from my experience even with full graph updates. How many layers do you have? And what is the size of your feature?

I have two GCN layers and node feature size is 200. I have a stack model and RGCN is a part of it.

Thanks for the example! In the readme file, it says that Currently, the example only support training RGCN graphs with no input features. I am wondering what are input features referring to here, are they node features?