0.5.2 source compile and run dist demo occur error


Environment
DGL Version (e.g., 1.0): 0.5.2
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):

mkdir build
cd build
cmake -DUSE_CUDA=ON …
make -j4
cd …/python
python setup.py install

Python version: 3.6
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): V100
Any other relevant information:


my trace as follow.

Traceback (most recent call last):
  File "train/train_dist_trainer.py", line 348, in <module>
    main(args)
  File "train/train_dist_trainer.py", line 290, in main
    run(args, device, data)
  File "train/train_dist_trainer.py", line 226, in run
    batch_pred = model(blocks, batch_inputs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "train/train_dist_trainer.py", line 82, in forward
    h = layer(block, h)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/nn/pytorch/conv/sageconv.py", line 192, in forward
    graph.update_all(fn.copy_src('h', 'm'), fn.mean('m', 'neigh'))
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/heterograph.py", line 4501, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
    z = op(graph, x)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 170, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 64, in gspmm
    lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/sparse.py", line 157, in _gspmm
    arg_e_nd)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [19:08:26] /sources/dgl/src/array/cuda/coo_sort.cu:160: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f88698ce9ff]
  [bt] (1) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(std::pair<bool, bool> dgl::aten::impl::COOIsSorted<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x252) [0x7f886a1179d3]
  [bt] (2) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOIsSorted(dgl::aten::COOMatrix)+0x1e3) [0x7f88698b3603]
  [bt] (3) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0xb4) [0x7f886a11509f]
  [bt] (4) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x3f3) [0x7f88698b22c3]
  [bt] (5) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetInCSR(bool) const+0x300) [0x7f886a0976f0]
  [bt] (6) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetCSCMatrix(unsigned long) const+0x16) [0x7f886a097a66]
  [bt] (7) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::HeteroGraph::GetCSCMatrix(unsigned long) const+0x23) [0x7f8869fca693]
  [bt] (8) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x1cb9) [0x7f88699d6059]

use pip dgl-cu101 install is ok, source compile ok, but run with cuda kernel not exist why ?
need help

Does your pytorch works fine?

yes, I test pytorch alone with the follow code and it work


import torch
from torchvision import models
import numpy as np

print(torch.cuda.is_available())

image = np.random.random(size=[2, 3, 224, 224])
image.dtype = ‘float32’

image_tensor = torch.from_numpy(image).cuda()

model = models.resnet50(pretrained=True)
model = model.cuda()

out = model(image_tensor)
print(out)

my nvidia driver is 450.51.06
cuda V10.2.89

import torch

print(torch.backends.cudnn.enabled)

True

print(torch.version)

1.6.0

print(torch.cuda.is_available())

True

device = torch.device(‘cuda’)

print(torch.cuda.get_device_properties(device))

_CudaDeviceProperties(name=‘Tesla V100-PCIE-16GB’, major=7, minor=0, total_memory=16160MB, multi_processor_count=80)

print(torch.tensor([1.0, 2.0]).cuda())

tensor([1., 2.], device=‘cuda:0’)

Are you using conda or pure python(pip) for your environment?

i use pure python(pip)

my software as follow:
absl-py (0.10.0)

cachetools (4.1.1)

certifi (2020.6.20)

chardet (3.0.4)

decorator (4.4.2)

dgl (0.6)

future (0.18.2)

google-auth (1.22.1)

google-auth-oauthlib (0.4.1)

grpcio (1.32.0)

idna (2.10)

importlib-metadata (2.0.0)

joblib (0.17.0)

littleutils (0.2.2)

Markdown (3.3)

networkx (2.5)

numpy (1.19.2)

oauthlib (3.1.0)

ogb (1.2.3)

outdated (0.2.0)

pandas (1.1.3)

pathlib (1.0.1)

Pillow (7.2.0)

pip (9.0.3)

protobuf (3.13.0)

pyasn1 (0.4.8)

pyasn1-modules (0.2.8)

pyinstrument (3.2.0)

pyinstrument-cext (0.2.2)

python-dateutil (2.8.1)

pytz (2020.1)

requests (2.24.0)

requests-oauthlib (1.3.0)

rsa (4.6)

scikit-learn (0.23.2)

scipy (1.5.2)

setuptools (50.3.0)

six (1.15.0)

tensorboard (2.3.0)

tensorboard-plugin-wit (1.7.0)

threadpoolctl (2.1.0)

torch (1.6.0)

torchvision (0.7.0)

tqdm (4.50.2)

urllib3 (1.25.10)

Werkzeug (1.0.1)

wheel (0.35.1)

zipp (3.3.0)

my yum package as follow:

cuda-cudart-10-2-10.2.89-1
cuda-compat-10-2
cuda-libraries-10-2-10.2.89-1
cuda-nvtx-10-2-10.2.89-1
cuda-npp-10-2-10.2.89-1
libcublas10-10.2.2.89-1
cuda-nvml-dev-10-2-10.2.89-1
cuda-command-line-tools-10-2-10.2.89-1
cuda-cudart-dev-10-2-10.2.89-1
cuda-libraries-dev-10-2-10.2.89-1
cuda-minimal-build-10-2-10.2.89-1
cuda-nvprof-10-2-10.2.89-1
cuda-npp-dev-10-2-10.2.89-1
libcublas-devel-10.2.2.89-1
ethtool less libibverbs-devel make net-tools nload libsecret
numactl numactl-devel patch sed sysstat unzip vim-enhanced which wget zlib-devel

This is weird and we’ve never met this problem before. V100 works well at my side under cuda 10.2. Do you have multiple cuda version on your machine? Our library is compiled for the SM arch up to 7.5, which should be fine for V100. How is your cuda driver installed?

only 10.2 cuda on the docker

cat /usr/local/cuda/version.txt
CUDA Version 10.2.89

nvidia-smi driver as follow

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   33C    P0    30W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

when I use dgl-cu102 binary package then the libdgl.so size is :
232028896 Oct 12 16:46 libdgl.so

when I use dgl source complie with cuda10.2 then the libdgl.so size is :
115488280 Oct 12 16:40 libdgl.so

why a huge difference?

cuda10.1 with same error

when compile ,it show “Found CUDA arch 5.2 5.2”
what does this mean ?

compile show ”NVCC extra flags: -gencode arch=compute_52,code=sm_52 “ is it suitable cuda 10.X

excuse me , I build source with cuda arch=5.2, but I run alg with t-v100 cards , so i think this is the problom.

so I want to ask , how to set ARCH_LIST for dgl when compile. like pytorch.

3q I had resovled this problom