[Error][PyTorch 1.11] cudaErrorCudartUnloading when using DGL on a GPU

anton-sturluson · March 11, 2022, 9:08pm

I’m not able to load any graph on GPU.

GPU: A100
Cuda version: 11.3 (confirmed with nvcc --version)
Pytorch version: 1.11.0
Python version: tried 3.8 and 3.9
Package manager: miniconda
Command used to download dgl: conda install -c dglteam dgl-cuda11.3

It works fine with Pytorch.

Failing code and error:

torch.cuda.is_available()
# True

import dgl
import torch
from ogb.nodeproppred import DglNodePropPredDataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataset = DglNodePropPredDataset('ogbn-arxiv')
dataset[0][0].to(device)

DGLError: [21:03:58] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:97: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: the provided PTX was compiled with an unsupported toolchain.
Stack trace:
  [bt] (0) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f3b93fde57f]
  [bt] (1) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::AllocDataSpace(DLContext, unsigned long, unsigned long, DLDataType)+0x108) [0x7f3b944b6828]
  [bt] (2) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext)+0x351) [0x7f3b94324b51]
  [bt] (3) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::CopyTo(DLContext const&, void* const&) const+0xc7) [0x7f3b9435ed87]
  [bt] (4) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dgl::UnitGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&, void* const&)+0x2ff) [0x7f3b9447985f]
  [bt] (5) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(dgl::HeteroGraph::CopyTo(std::shared_ptr<dgl::BaseHeteroGraph>, DLContext const&, void* const&)+0x109) [0x7f3b94370979]
  [bt] (6) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(+0x6aec89) [0x7f3b9437dc89]
  [bt] (7) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f3b94302ea8]
  [bt] (8) /home/miniconda3/envs/gnn-p38/lib/python3.8/site-packages/dgl/_ffi/_cy3/core.cpython-38-x86_64-linux-gnu.so(+0x16fb9) [0x7f3b93c32fb9]

minjie · March 12, 2022, 3:15am

@BarclayII seems to be an issue with the latest PyTorch (1.11). @anton-sturluson , could you please let us know your DGL version?

BarclayII · March 14, 2022, 6:22am

And also the CUDA driver version?

anton-sturluson · March 14, 2022, 3:56pm

DGL version: dgl-cuda11.3 0.8.0post1
Driver version: 460.73.01

anton-sturluson · March 14, 2022, 4:06pm

I can confirm that the same code works with pytorch 1.10.2 and the same DGL/Cuda versions.

anton-sturluson · March 15, 2022, 3:09pm

One weird thing I noticed is that I get the same error with pytorch 1.10.2 when using A100. Using V100 works fine though. I confirmed this through GCP.

BarclayII · March 21, 2022, 5:57am

Could you try updating your driver? 460.73 seems a bit too old (mine is 510.47.03 and it seems to work on my A100).

anton-sturluson · April 14, 2022, 5:13pm

I was finally able to upgrade my nvidia driver, and it works fine with A100 and Pytorch 1.11.0

system · May 14, 2022, 5:14pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.