pinSage,out of memory with cuda 11.0

vincentami · October 19, 2022, 1:26pm

Hi,
I am trying to taining a new pinSage model on my own dataset(10M edge, 0.8M Item nodes with feature). However, I noticed that the memroy of GPU used increase slowly ,and the whole procedure failed at last after ten epochs always. Dose any one hit the same issue ? Wish your help, thanks.

here is some error message
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File “/home/dolphinfs_lilifeng/anaconda3/envs/myEnv/lib/python3.7/site-packages/dgl/sparse.py”, line 233, in _gspmm
arg_e_nd)
File “dgl/_ffi/_cython/./function.pxi”, line 293, in dgl._ffi._cy3.core.FunctionBase.call
File “dgl/_ffi/_cython/./function.pxi”, line 239, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [15:27:43] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:114: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: out of memory

czkkkkkk · October 19, 2022, 1:44pm

Looks like a memory leak problem. Could you check whether some tensors are not released during training?

vincentami · October 20, 2022, 2:45am

I do not change any code of the master head for the pinSage example in the dgl

czkkkkkk · October 20, 2022, 3:04am

I see. Which DGL and Pytorch version did you use?

vincentami · October 20, 2022, 6:43am

Hoo !!! I Got it~~
My Code here is bug, tensor operation here lead to the gpu memroy leak.
"
total_loss += loss
batch_id_ep = batch_id_ep + 1
if batch_id % 301 == 300:
print("######## batch_id:{} total_loss:{} loss:{}".format(batch_id, total_loss / batch_id_ep,loss))
"

The issue disappear while the code change into this:
"
total_loss += float(loss)
batch_id_ep = batch_id_ep + 1
if batch_id%301 == 300:
print("######## batch_id:{} total_loss:{} loss:{}".format(batch_id, total_loss/batch_id_ep, float(loss)))
"

Thanks czkkkkkk haha

Rhett-Ying · October 21, 2022, 6:17am

usually loss.item() is used to convert tensor into plain python value(namely, moved to CPU).

system · November 20, 2022, 6:18am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.