Problem when runing R-GCN in parallel

510443664 · February 19, 2020, 7:48am

Hi,
I use the code of relational GCN(pytorch version)https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn, an error occured when i feed in the data.

It happend in the file ‘model.py’ , line 47

h = layer(g, h, r, norm)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/torch/nn/modules/module.py”, line 547, in call
result = self.forward(*input, **kwargs)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 180, in forward
g.update_all(self.message_func, fn.sum(msg=‘msg’, out=‘h’))
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/graph.py”, line 2747, in update_all
Runtime.run(prog)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/runtime.py”, line 11, in run
exe.run()
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/ir/executor.py”, line 204, in run
udf_ret = fn_data(src_data, edge_data, dst_data)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/runtime/scheduler.py”, line 949, in _mfunc_wrapper
return mfunc(ebatch)
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/conv/relgraphconv.py”, line 133, in basis_message_func
msg = utils.bmm_maybe_select(edges.src[‘h’], weight, edges.data[‘type’])
File “/research/dept6/yhlong/venv/lib64/python3.6/site-packages/dgl/nn/pytorch/utils.py”, line 91, in bmm_maybe_select
return th.bmm(A.unsqueeze(1), BB).squeeze()
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:486

I used two GPUs to train
Pytorch version: 1.2.0
DGL version: 0.4.2

Maybe there are some problems with runing it in parallel?
Could someone help me with the question?
Thanks a lot!

classicsong · February 24, 2020, 7:38am

Can you provide more information about your code?
Are you running RGCN with multi-gpu?

classicsong · February 27, 2020, 6:41am

For multigpu training, please refer to this PR: https://github.com/dmlc/dgl/pull/1143

I think the problem you encountered is something like: node_features are in cuda:0 while edge features are in cuda:1.

510443664 · February 27, 2020, 7:28am

Thanks! I think that’s the problem i encountered.

classicsong · February 28, 2020, 12:47am

You should handle the graph features carefully by moving them to different devices when using multi-gpu.

510443664 · February 28, 2020, 1:29am

In my code, I use

model = torch.nn.DataParallel(model)

to enable the multi-gpu.

And I input a batch of graph features during taining, i modify the ‘forward()’ function in class ‘BaseRGCN’:

def forward(self, inputs):
    out = torch.clone(inputs)
    for _batch in range(inputs.shape[0]):
        temp = self.layers[0](self.g, inputs[_batch], self.edge_type, self.edge_norm)
        out[_batch] = self.layers[1](self.g, temp, self.edge_type, self.edge_norm)
    return out

And I using following code to upload the edge type:

edge_type = edge_type.cuda()

It works on one GPU, but the error above occured when using multi GPU.
Is there any suggestion to modify the code?
Thanks.

classicsong · March 2, 2020, 3:10am

For edge_type = edge_type.cuda(), I think you may need to specify the device ID.

chenxuhao · March 15, 2020, 10:21pm

Hello, did you modify the code? I don’t know that R-GCN in DGL supports multi-GPU training.

minjie · March 16, 2020, 8:59am

For anyone who is interested in multi-GPU training, please look at this newly-added example for training GraphSAGE. Extending it to RGCN should be straightforward by replacing the SAGEConv module with a RelGraphConv module. We are also working on a step-by-step tutorial. Please stay tuned.

xdwang0726 · September 22, 2020, 3:03pm

I am trying to do model parallel using torch.distributed as my RGCN model is large. I have also faced this problem that I put g_node_feature and entire g in the different devices, and I got an RuntimeError said that variable are in the wrong device. I am wondering is it mandatory that the graph must be placed in one device? Thanks!

xdwang0726 · September 23, 2020, 3:26am

Is this provided tutorial for data parallel or model parallel? If it is for model parallel the model = DistributedDataParallel(model, device_ids=[dev_id], output_device=dev_id) in the tutorial should not set device_ids and output_device according to pytorch’s tutorial

mufeili · September 23, 2020, 3:59am

Yes, the graph and the corresponding tensors need to be on the same device.

mufeili · September 23, 2020, 3:59am

This is for data parallel.

xdwang0726 · September 23, 2020, 4:00am

Do you happen to have a tutorial in model parallel? Thanks!

mufeili · September 23, 2020, 4:06am

So far we have not tried model parallel so there isn’t a tutorial for this.

xdwang0726 · September 24, 2020, 1:37am

Thanks! I am wondering if I have a large rgcn model (~28,000 nodes and ~39,000 edges in 2 type) how could I fit into the memory? I am running the model on 16G V100, which always return the CUDA out of memory issue.

mufeili · September 24, 2020, 5:43am

Have you tried sampling-based training? You can find some examples here for mini-batch training.

BarclayII · September 28, 2020, 7:35am

28k nodes and 39k edges should not exhaust GPU memory from my experience even with full graph updates. How many layers do you have? And what is the size of your feature?

xdwang0726 · September 29, 2020, 2:54am

I have two GCN layers and node feature size is 200. I have a stack model and RGCN is a part of it.

xdwang0726 · September 29, 2020, 3:10am

Thanks for the example! In the readme file, it says that Currently, the example only support training RGCN graphs with no input features. I am wondering what are input features referring to here, are they node features?