Environment
DGL Version (e.g., 1.0): 0.5.2
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):
mkdir build
cd build
cmake -DUSE_CUDA=ON …
make -j4
cd …/python
python setup.py install
Python version: 3.6
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): V100
Any other relevant information:
my trace as follow.
Traceback (most recent call last):
File "train/train_dist_trainer.py", line 348, in <module>
main(args)
File "train/train_dist_trainer.py", line 290, in main
run(args, device, data)
File "train/train_dist_trainer.py", line 226, in run
batch_pred = model(blocks, batch_inputs)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "train/train_dist_trainer.py", line 82, in forward
h = layer(block, h)
File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/nn/pytorch/conv/sageconv.py", line 192, in forward
graph.update_all(fn.copy_src('h', 'm'), fn.mean('m', 'neigh'))
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/heterograph.py", line 4501, in update_all
ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
ndata = invoke_gspmm(g, mfunc, rfunc)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
z = op(graph, x)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 170, in func
return gspmm(g, 'copy_lhs', reduce_op, x, None)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 64, in gspmm
lhs_data, rhs_data)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/sparse.py", line 157, in _gspmm
arg_e_nd)
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 190, in __call__
ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/base.py", line 62, in check_call
raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [19:08:26] /sources/dgl/src/array/cuda/coo_sort.cu:160: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: no kernel image is available for execution on the device
Stack trace:
[bt] (0) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f88698ce9ff]
[bt] (1) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(std::pair<bool, bool> dgl::aten::impl::COOIsSorted<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x252) [0x7f886a1179d3]
[bt] (2) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOIsSorted(dgl::aten::COOMatrix)+0x1e3) [0x7f88698b3603]
[bt] (3) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0xb4) [0x7f886a11509f]
[bt] (4) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x3f3) [0x7f88698b22c3]
[bt] (5) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetInCSR(bool) const+0x300) [0x7f886a0976f0]
[bt] (6) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetCSCMatrix(unsigned long) const+0x16) [0x7f886a097a66]
[bt] (7) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::HeteroGraph::GetCSCMatrix(unsigned long) const+0x23) [0x7f8869fca693]
[bt] (8) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x1cb9) [0x7f88699d6059]