Dgl0.7 with nccl init error when using nvlink-v100

lixusign · October 8, 2021, 8:43am

hi when I use nvlink-v100， nccl =2.8.4， 8gpu store embedding。 there occur a error as follow:

in nccl init code can’t initial nccl

and

when I use nvlink-v100， nccl =2.8.4， 4gpu store embedding, there will be ok。

and

when I use pcie-v100, nccl=2.8.4 8gpu store embedding ,there will be ok too.

can anybody help?

when I set nccl_debug=info ， there is a warn ”init.cc:902 NCCL WARN Cuda failure ‘out of memory’“

type or paste code here
```File "/usr/local/lib64/python3.6/site-packages/dgl/cuda/nccl.py", line 75, in __init__
    self._handle = _CAPI_DGLNCCLCreateComm(size, rank, unique_id.get())
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:34:16] /opt/dgl/src/runtime/cuda/nccl_api.cu:556: NCCLError: ncclCommInitRank(&comm_, size_, id, rank_) failed with error: 1
Stack trace:
  [bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb9e46b52bf]
  [bt] (1) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::cuda::NCCLCommunicator::NCCLCommunicator(int, int, ncclUniqueId)+0x139) [0x7fb9e50de129]
  [bt] (2) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc74738) [0x7fb9e50de738]
  [bt] (3) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc75094) [0x7fb9e50df094]
  [bt] (4) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fb9e493a598]
  [bt] (5) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x167a3) [0x7fb9e40317a3]
  [bt] (6) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x16acb) [0x7fb9e4031acb]
  [bt] (7) /usr/lib64/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x90) [0x7fbd79255640]
  [bt] (8) /usr/lib64/libpython3.6m.so.1.0(+0x151bec) [0x7fbd792febec

VoVAllen · October 8, 2021, 9:02am

Hi,

This seems a bug. Can you raise an issue at dgl repo, with the information of your dgl version? Thanks

lixusign · October 8, 2021, 9:17am

thx for reply

how can I raise an issue at dgl repo ? my version is dgl-cu110 pip package with torch = 1.7 and nccl 2.8.4.

but the code this line “NCCL_CALL(ncclCommInitRank(&comm_, size_, id, rank_));” in nccl_api.cu has any problem ？

VoVAllen · October 8, 2021, 9:23am

At Sign in to GitHub · GitHub
It seems that NCCL is not properly configured. We’ll have people working on this part to answer your question at the issue

lixusign · October 8, 2021, 9:31am

thx for reply
In my test, when some gpu cross numa node , this problem will be happen.

my nv topo -m is as follow:

system · November 7, 2021, 9:32am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.