Hi, I got some bugs when training OGB-Product in an 8-node cluster.
If I set a small batch size, i.e., 1024. I found that the last worker would finish the first epoch with a correct output. However, all else workers will get the following error message:
** On entry to cusparseCreateCsr() dimension mismatch: nnz > rows * cols
Traceback (most recent call last):
File "baseline/graphsage/train_dist_sage.py", line 392, in <module>
main(args)
File "baseline/graphsage/train_dist_sage.py", line 349, in main
run(args, device, data)
File "baseline/graphsage/train_dist_sage.py", line 250, in run
batch_pred = model(blocks, batch_inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "baseline/graphsage/train_dist_sage.py", line 88, in forward
h = layer(block, h)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/conv/sageconv.py", line 231, in forward
graph.update_all(msg_fn, fn.mean('m', 'neigh'))
File "/usr/local/lib/python3.6/dist-packages/dgl/heterograph.py", line 4686, in update_all
ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
File "/usr/local/lib/python3.6/dist-packages/dgl/core.py", line 283, in message_passing
ndata = invoke_gspmm(g, mfunc, rfunc)
File "/usr/local/lib/python3.6/dist-packages/dgl/core.py", line 258, in invoke_gspmm
z = op(graph, x)
File "/usr/local/lib/python3.6/dist-packages/dgl/ops/spmm.py", line 170, in func
return gspmm(g, 'copy_lhs', reduce_op, x, None)
File "/usr/local/lib/python3.6/dist-packages/dgl/ops/spmm.py", line 64, in gspmm
lhs_data, rhs_data)
File "/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/sparse.py", line 307, in gspmm
return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/amp/autocast_mode.py", line 217, in decorate_fwd
return fwd(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/sparse.py", line 87, in forward
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File "/usr/local/lib/python3.6/dist-packages/dgl/sparse.py", line 162, in _gspmm
arg_e_nd)
File "/usr/local/lib/python3.6/dist-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib/python3.6/dist-packages/dgl/_ffi/base.py", line 64, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [11:01:50] /opt/dgl/src/array/cuda/spmm.cu:233: Check failed: e == CUSPARSE_STATUS_SUCCESS: CUSPARSE ERROR: 3
Stack trace:
[bt] (0) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f4884d1f32f]
[bt] (1) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(void dgl::aten::cusparse::CusparseCsrmm2<float, long>(DLContext const&, dgl::aten::CSRMatrix const&, float const*, float const*, float*, int)+0x160) [0x7f4885925300]
[bt] (2) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(void dgl::aten::SpMMCsr<2, long, 32>(std::string const&, std::string const&, dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0xdc) [0x7f488596e95c]
[bt] (3) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x2633) [0x7f4884e7e593]
[bt] (4) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(+0x6b6838) [0x7f4884e8a838]
[bt] (5) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(+0x6b6dd1) [0x7f4884e8add1]
[bt] (6) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f4885423538]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f498946ddae]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f498946d71f]
But things work well with large batch size, i.e., 16384.
Would anyone please give me a suggestion?