0.5.2 source compile and run dist demo occur error

lixusign · October 10, 2020, 11:17am

Environment
DGL Version (e.g., 1.0): 0.5.2
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):

mkdir build
cd build
cmake -DUSE_CUDA=ON …
make -j4
cd …/python
python setup.py install

Python version: 3.6
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): V100
Any other relevant information:

my trace as follow.

Traceback (most recent call last):
  File "train/train_dist_trainer.py", line 348, in <module>
    main(args)
  File "train/train_dist_trainer.py", line 290, in main
    run(args, device, data)
  File "train/train_dist_trainer.py", line 226, in run
    batch_pred = model(blocks, batch_inputs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "train/train_dist_trainer.py", line 82, in forward
    h = layer(block, h)
  File "/usr/local/lib64/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/nn/pytorch/conv/sageconv.py", line 192, in forward
    graph.update_all(fn.copy_src('h', 'm'), fn.mean('m', 'neigh'))
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/heterograph.py", line 4501, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
    z = op(graph, x)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 170, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/ops/spmm.py", line 64, in gspmm
    lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/sparse.py", line 157, in _gspmm
    arg_e_nd)
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/_ffi/base.py", line 62, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [19:08:26] /sources/dgl/src/array/cuda/coo_sort.cu:160: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA kernel launch error: no kernel image is available for execution on the device
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f88698ce9ff]
  [bt] (1) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(std::pair<bool, bool> dgl::aten::impl::COOIsSorted<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0x252) [0x7f886a1179d3]
  [bt] (2) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOIsSorted(dgl::aten::COOMatrix)+0x1e3) [0x7f88698b3603]
  [bt] (3) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DLDeviceType)2, long>(dgl::aten::COOMatrix)+0xb4) [0x7f886a11509f]
  [bt] (4) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x3f3) [0x7f88698b22c3]
  [bt] (5) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetInCSR(bool) const+0x300) [0x7f886a0976f0]
  [bt] (6) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::UnitGraph::GetCSCMatrix(unsigned long) const+0x16) [0x7f886a097a66]
  [bt] (7) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::HeteroGraph::GetCSCMatrix(unsigned long) const+0x23) [0x7f8869fca693]
  [bt] (8) /usr/local/lib/python3.6/site-packages/dgl-0.6-py3.6-linux-x86_64.egg/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x1cb9) [0x7f88699d6059]

lixusign · October 12, 2020, 1:05am

use pip dgl-cu101 install is ok, source compile ok, but run with cuda kernel not exist why ?
need help

VoVAllen · October 12, 2020, 4:41am

Does your pytorch works fine?

lixusign · October 12, 2020, 5:33am

yes, I test pytorch alone with the follow code and it work

import torch
from torchvision import models
import numpy as np

print(torch.cuda.is_available())

image = np.random.random(size=[2, 3, 224, 224])
image.dtype = ‘float32’

image_tensor = torch.from_numpy(image).cuda()

model = models.resnet50(pretrained=True)
model = model.cuda()

out = model(image_tensor)
print(out)

lixusign · October 12, 2020, 5:39am

my nvidia driver is 450.51.06
cuda V10.2.89

lixusign · October 12, 2020, 6:12am

import torch

print(torch.backends.cudnn.enabled)

True

print(torch.version)

1.6.0

print(torch.cuda.is_available())

True

device = torch.device(‘cuda’)

print(torch.cuda.get_device_properties(device))

_CudaDeviceProperties(name=‘Tesla V100-PCIE-16GB’, major=7, minor=0, total_memory=16160MB, multi_processor_count=80)

print(torch.tensor([1.0, 2.0]).cuda())

tensor([1., 2.], device=‘cuda:0’)

VoVAllen · October 12, 2020, 6:59am

Are you using conda or pure python(pip) for your environment?

lixusign · October 12, 2020, 7:20am

i use pure python(pip)

my software as follow:
absl-py (0.10.0)

cachetools (4.1.1)

certifi (2020.6.20)

chardet (3.0.4)

decorator (4.4.2)

dgl (0.6)

future (0.18.2)

google-auth (1.22.1)

google-auth-oauthlib (0.4.1)

grpcio (1.32.0)

idna (2.10)

importlib-metadata (2.0.0)

joblib (0.17.0)

littleutils (0.2.2)

Markdown (3.3)

networkx (2.5)

numpy (1.19.2)

oauthlib (3.1.0)

ogb (1.2.3)

outdated (0.2.0)

pandas (1.1.3)

pathlib (1.0.1)

Pillow (7.2.0)

pip (9.0.3)

protobuf (3.13.0)

pyasn1 (0.4.8)

pyasn1-modules (0.2.8)

pyinstrument (3.2.0)

pyinstrument-cext (0.2.2)

python-dateutil (2.8.1)

pytz (2020.1)

requests (2.24.0)

requests-oauthlib (1.3.0)

rsa (4.6)

scikit-learn (0.23.2)

scipy (1.5.2)

setuptools (50.3.0)

six (1.15.0)

tensorboard (2.3.0)

tensorboard-plugin-wit (1.7.0)

threadpoolctl (2.1.0)

torch (1.6.0)

torchvision (0.7.0)

tqdm (4.50.2)

urllib3 (1.25.10)

Werkzeug (1.0.1)

wheel (0.35.1)

zipp (3.3.0)

lixusign · October 12, 2020, 7:39am

my yum package as follow:

cuda-cudart-10-2-10.2.89-1
cuda-compat-10-2
cuda-libraries-10-2-10.2.89-1
cuda-nvtx-10-2-10.2.89-1
cuda-npp-10-2-10.2.89-1
libcublas10-10.2.2.89-1
cuda-nvml-dev-10-2-10.2.89-1
cuda-command-line-tools-10-2-10.2.89-1
cuda-cudart-dev-10-2-10.2.89-1
cuda-libraries-dev-10-2-10.2.89-1
cuda-minimal-build-10-2-10.2.89-1
cuda-nvprof-10-2-10.2.89-1
cuda-npp-dev-10-2-10.2.89-1
libcublas-devel-10.2.2.89-1
ethtool less libibverbs-devel make net-tools nload libsecret
numactl numactl-devel patch sed sysstat unzip vim-enhanced which wget zlib-devel

VoVAllen · October 12, 2020, 7:50am

This is weird and we’ve never met this problem before. V100 works well at my side under cuda 10.2. Do you have multiple cuda version on your machine? Our library is compiled for the SM arch up to 7.5, which should be fine for V100. How is your cuda driver installed?

lixusign · October 12, 2020, 8:07am

only 10.2 cuda on the docker

cat /usr/local/cuda/version.txt
CUDA Version 10.2.89

nvidia-smi driver as follow

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   33C    P0    30W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

lixusign · October 13, 2020, 3:38am

when I use dgl-cu102 binary package then the libdgl.so size is ：
232028896 Oct 12 16:46 libdgl.so

when I use dgl source complie with cuda10.2 then the libdgl.so size is ：
115488280 Oct 12 16:40 libdgl.so

why a huge difference?

lixusign · October 13, 2020, 9:46am

cuda10.1 with same error

lixusign · October 13, 2020, 10:32am

when compile ,it show “Found CUDA arch 5.2 5.2”
what does this mean ?

lixusign · October 13, 2020, 10:42am

compile show ”NVCC extra flags: -gencode arch=compute_52,code=sm_52 “ is it suitable cuda 10.X

lixusign · October 13, 2020, 12:13pm

excuse me , I build source with cuda arch=5.2, but I run alg with t-v100 cards ， so i think this is the problom.

so I want to ask , how to set ARCH_LIST for dgl when compile. like pytorch.

lixusign · October 13, 2020, 1:05pm

3q I had resovled this problom

0.5.2 source compile and run dist demo occur error

Environment DGL Version (e.g., 1.0): 0.5.2 Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6 OS (e.g., Linux): Linux How you installed DGL (conda, pip, source): source Build command you used (if compiling from source):

mkdir build cd build cmake -DUSE_CUDA=ON … make -j4 cd …/python python setup.py install

Environment
DGL Version (e.g., 1.0): 0.5.2
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source):

mkdir build
cd build
cmake -DUSE_CUDA=ON …
make -j4
cd …/python
python setup.py install