For the original RGCN implementation the author claims that running on CPU is faster than GPU due to GPU for sparse operation is not essential. I am wondering for DGL’s implementation, will the speed be quite different between cpu and gpu when using RGCN for a large graph (nodes: 30000, edges:40000)? Thanks.
The RGCN paper was written when Deep Learning frameworks were not ready for sparse computations.
DGL have optimized commonly used sparse operations, training a RGCN on GPU is much faster than training RGCN on CPU.
See the table 3 in dgl design paper: https://arxiv.org/pdf/1909.01315.pdf. RGCN on GPU is 30-40x faster than on CPU for ogbn-proteins graph.
Thanks for your reply! I am also wondering is my graph relatively big as I got an out of memory issue when I training to train a rgcn model on 30,000 nodes and 40,000 edges on a 16G GPU. Thanks!
Note that there is a low_mem
flag in our RelGraphConv module:
Set it to True and see if OOM issue still exists.
I think so, but by default the low_mem
option is not activated.
When I activated the low_mem
in the rgcn example, which gives me the error message "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL dgl._ffi.base.DGLError: [23:40:23] /home/coulombc/wheels_builder/tmp.30875/python-3.8/dgl/src/array/cuda/utils.cu:19: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: invalid device ordinal
I am using python==3.8, torch==1.7, cuda==10.2
If the low_mem
is off, then it returns OOM error RuntimeError: CUDA out of memory. Tried to allocate 6.08 GiB (GPU 0; 15.78 GiB total capacity; 9.77 GiB already allocated; 4.61 GiB free; 9.92 GiB reserved in total by PyTorch)
Sorry but DGL has some compatibility problem with PyTorch 1.7.
A patch version would be released this week, before that you can try DGL nightly build version via:
pip install --pre dgl-cu102
Thanks! I downgrade torch to 1.6 and it works!