hi when I use nvlink-v100, nccl =2.8.4, 8gpu store embedding。 there occur a error as follow:
in nccl init code can’t initial nccl
and
when I use nvlink-v100, nccl =2.8.4, 4gpu store embedding, there will be ok。
and
when I use pcie-v100, nccl=2.8.4 8gpu store embedding ,there will be ok too.
can anybody help?
when I set nccl_debug=info , there is a warn ”init.cc:902 NCCL WARN Cuda failure ‘out of memory’“
type or paste code here
```File "/usr/local/lib64/python3.6/site-packages/dgl/cuda/nccl.py", line 75, in __init__
self._handle = _CAPI_DGLNCCLCreateComm(size, rank, unique_id.get())
File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:34:16] /opt/dgl/src/runtime/cuda/nccl_api.cu:556: NCCLError: ncclCommInitRank(&comm_, size_, id, rank_) failed with error: 1
Stack trace:
[bt] (0) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fb9e46b52bf]
[bt] (1) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::cuda::NCCLCommunicator::NCCLCommunicator(int, int, ncclUniqueId)+0x139) [0x7fb9e50de129]
[bt] (2) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc74738) [0x7fb9e50de738]
[bt] (3) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(+0xc75094) [0x7fb9e50df094]
[bt] (4) /usr/local/lib64/python3.6/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fb9e493a598]
[bt] (5) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x167a3) [0x7fb9e40317a3]
[bt] (6) /usr/local/lib64/python3.6/site-packages/dgl/_ffi/_cy3/core.cpython-36m-x86_64-linux-gnu.so(+0x16acb) [0x7fb9e4031acb]
[bt] (7) /usr/lib64/libpython3.6m.so.1.0(_PyObject_FastCallDict+0x90) [0x7fbd79255640]
[bt] (8) /usr/lib64/libpython3.6m.so.1.0(+0x151bec) [0x7fbd792febec