Dgl0.7 with nccl init error when using nvlink-v100

hi when I use nvlink-v100, nccl =2.8.4, 8gpu store embedding。 there occur a error as follow:

in nccl init code can’t initial nccl

and

when I use nvlink-v100, nccl =2.8.4, 4gpu store embedding, there will be ok。

and

when I use pcie-v100, nccl=2.8.4 8gpu store embedding ,there will be ok too.

can anybody help?

when I set nccl_debug=info , there is a warn ”init.cc:902 NCCL WARN Cuda failure ‘out of memory’“

type or paste code here
```File "/usr/local/lib64/python3.6/site-packages/dgl/cuda/nccl.py", line 75, in __init__
    self._handle = _CAPI_DGLNCCLCreateComm(size, rank, unique_id.get())
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:34:16] /opt/dgl/src/runtime/cuda/nccl_api.cu:556: NCCLError: ncclCommInitRank(&comm_, size_, id, rank_) failed with error: 1
Stack trace:
  [bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb9e46b52bf]
  [bt] (1) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::cuda::NCCLCommunicator::NCCLCommunicator(int, int, ncclUniqueId)+0x139) [0x7fb9e50de129]
  [bt] (2) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc74738) [0x7fb9e50de738]
  [bt] (3) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc75094) [0x7fb9e50df094]
  [bt] (4) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fb9e493a598]
  [bt] (5) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x167a3) [0x7fb9e40317a3]
  [bt] (6) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x16acb) [0x7fb9e4031acb]
  [bt] (7) /usr/lib64/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x90) [0x7fbd79255640]
  [bt] (8) /usr/lib64/libpython3.6m.so.1.0(+0x151bec) [0x7fbd792febec

Hi,

This seems a bug. Can you raise an issue at dgl repo, with the information of your dgl version? Thanks

thx for reply

how can I raise an issue at dgl repo ? my version is dgl-cu110 pip package with torch = 1.7 and nccl 2.8.4.

but the code this line “NCCL_CALL(ncclCommInitRank(&comm_, size_, id, rank_));” in nccl_api.cu has any problem ?

At Sign in to GitHub · GitHub
It seems that NCCL is not properly configured. We’ll have people working on this part to answer your question at the issue

thx for reply
In my test, when some gpu cross numa node , this problem will be happen.

my nv topo -m is as follow: