HeteroGraphConv not supported by cuda?

galbraun1337 · August 30, 2020, 11:25am

Hey there,

I’m trying to create a graph classifier for heterogeneous graphs in google Colab using Cuda 10.1 and PyTorch backend with dgl 0.5.
When the forward function is called i get this error:

DGLError: [07:58:12] /opt/dgl/src/array/array.cc:606: Operator COOGetRowNNZ does not support cuda device.

This caused when I’m trying to use HeteroGraphConv.
Is it not supported? or you have any suggestions on what should I check I did wrong.

Thanks in advance!

here is the full stack trace:

---> 89     res, reps = model(all_graphs)
     90     pred = res.argmax(1)
     91     train_acc = (pred == torch.tensor(all_labels)).float().mean()

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

<ipython-input-13-55b264750139> in forward(self, g)
     34         h = g.ndata['feat'] 
     35         # Apply graph convolution and activation.
---> 36         h = self.conv1(g, h)
     37         h = {key: F.relu(h[key]) for key in h.keys()}
     38         h = self.conv2(g, h)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/hetero.py in forward(self, g, inputs, mod_args, mod_kwargs)
    172                     inputs[stype],
    173                     *mod_args.get(etype, ()),
--> 174                     **mod_kwargs.get(etype, {}))
    175                 outputs[dtype].append(dstdata)
    176         rsts = {}

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/conv/graphconv.py in forward(self, graph, feat, weight)
    248             feat_src, feat_dst = expand_as_pair(feat, graph)
    249             if self._norm == 'both':
--> 250                 degs = graph.out_degrees().float().clamp(min=1)
    251                 norm = th.pow(degs, -0.5)
    252                 shp = norm.shape + (1,) * (feat_src.dim() - 1)

/usr/local/lib/python3.6/dist-packages/dgl/heterograph.py in out_degrees(self, u, etype)
   3267         if F.as_scalar(F.sum(self.has_nodes(u_tensor, ntype=srctype), dim=0)) != len(u_tensor):
   3268             raise DGLError('u contains invalid node IDs')
-> 3269         deg = self._graph.out_degrees(etid, utils.prepare_tensor(self, u, 'u'))
   3270         if isinstance(u, numbers.Integral):
   3271             return F.as_scalar(deg)

/usr/local/lib/python3.6/dist-packages/dgl/heterograph_index.py in out_degrees(self, etype, v)
    595         """
    596         return F.from_dgl_nd(_CAPI_DGLHeteroOutDegrees(
--> 597             self, int(etype), F.to_dgl_nd(v)))
    598 
    599     def adjacency_matrix(self, etype, transpose, ctx):

/usr/local/lib/python3.6/dist-packages/dgl/_ffi/_ctypes/function.py in __call__(self, *args)
    188         check_call(_LIB.DGLFuncCall(
    189             self.handle, values, tcodes, ctypes.c_int(num_args),
--> 190             ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
    191         _ = temp_args
    192         _ = args

/usr/local/lib/python3.6/dist-packages/dgl/_ffi/base.py in check_call(ret)
     60     """
     61     if ret != 0:
---> 62         raise DGLError(py_str(_LIB.DGLGetLastError()))
     63 
     64 

DGLError: [07:58:12] /opt/dgl/src/array/array.cc:606: Operator COOGetRowNNZ does not support cuda device.
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x22) [0x7f0743867e02]
  [bt] (1) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::aten::COOGetRowNNZ(dgl::aten::COOMatrix, dgl::runtime::NDArray)+0xc9) [0x7f074385bb19]
  [bt] (2) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::UnitGraph::COO::OutDegrees(unsigned long, dgl::runtime::NDArray) const+0x102) [0x7f07440bd9c2]
  [bt] (3) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::UnitGraph::OutDegrees(unsigned long, dgl::runtime::NDArray) const+0x61) [0x7f07440b6241]
  [bt] (4) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::HeteroGraph::OutDegrees(unsigned long, dgl::runtime::NDArray) const+0x45) [0x7f0743fdffb5]
  [bt] (5) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(+0xf4ab95) [0x7f0743fedb95]
  [bt] (6) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(DGLFuncCall+0x52) [0x7f0743f6e232]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f07b31a2dae]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f07b31a271f]

zihao · August 30, 2020, 4:01pm

Of course it supports CUDA, would you mind checking the allowed formats of your graph by printing g.formats() (suppose your graph is g)?

galbraun1337 · August 30, 2020, 6:35pm

Thanks for answering!

i get: {'created': ['coo'], 'not created': ['csr', 'csc']}

but i tried now with the brand new 0.5.1 and i get a different exception:
RuntimeError: CUDA error: an illegal memory access was encountered

with stacktrace:

---> 90     res, reps = model(all_graphs)
     91     pred = res.argmax(1)
     92     train_acc = (pred == torch.tensor(all_labels)).float().mean()

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

<ipython-input-11-0ce66a694343> in forward(self, g)
     35         # Apply graph convolution and activation.
     36         print(g.formats())
---> 37         h = self.conv1(g, h)
     38         h = {key: F.relu(h[key]) for key in h.keys()}
     39         h = self.conv2(g, h)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/hetero.py in forward(self, g, inputs, mod_args, mod_kwargs)
    172                     inputs[stype],
    173                     *mod_args.get(etype, ()),
--> 174                     **mod_kwargs.get(etype, {}))
    175                 outputs[dtype].append(dstdata)
    176         rsts = {}

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/conv/graphconv.py in forward(self, graph, feat, weight)
    280 
    281             if self._norm != 'none':
--> 282                 degs = graph.in_degrees().float().clamp(min=1)
    283                 if self._norm == 'both':
    284                     norm = th.pow(degs, -0.5)

/usr/local/lib/python3.6/dist-packages/dgl/heterograph.py in in_degrees(self, v, etype)
   3179         etid = self.get_etype_id(etype)
   3180         if is_all(v):
-> 3181             v = self.dstnodes(dsttype)
   3182         v_tensor = utils.prepare_tensor(self, v, 'v')
   3183         deg = self._graph.in_degrees(etid, v_tensor)

/usr/local/lib/python3.6/dist-packages/dgl/view.py in __call__(self, ntype)
     43         return F.copy_to(F.arange(0, self._graph._graph.number_of_nodes(ntid),
     44                                   dtype=self._graph.idtype),
---> 45                          self._graph.device)
     46 
     47 class HeteroNodeDataView(MutableMapping):

/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/tensor.py in copy_to(input, ctx, **kwargs)
    111         if ctx.index is not None:
    112             th.cuda.set_device(ctx.index)
--> 113         return input.cuda(**kwargs)
    114     else:
    115         raise RuntimeError('Invalid context', ctx)

RuntimeError: CUDA error: an illegal memory access was encountered

VoVAllen · August 31, 2020, 5:46am

Hi,

What’s your torch version? Could you try update torch>=1.5?

galbraun1337 · August 31, 2020, 7:56am

Hi,

Using torch==1.6.0

VoVAllen · August 31, 2020, 8:37am

Hi,

It seems a dgl bug. Could you share the code with us to debug? You can use private message in the forum if needed.

galbraun1337 · August 31, 2020, 9:23am

Yeah, i removed the sensitive parts and uploaded the ipynb file to drive so it can be opened with Colab:

Thanks for your help!

VoVAllen · August 31, 2020, 10:10am

Hi,

What’s your graph.bin and graph_info looks like? We need basic information to create a mock graphs to run

galbraun1337 · August 31, 2020, 11:33am

so graph_info is a dictionary with 2 internal dictionaries:

dataset_id - a list of strings in the length of amount graphs i have, that represent the datasource from which the relevant graph came from.
relations - list of strings that say the unique relations names that exists in the graphs.

regard the graph.bin:
the labels is:

the dictionary contain a torch tensor with binary lables.

And unfortunately the graphs themselves i can’t upload but which information about them you’ll need? i’ll be happy to elaborate what i can.

VoVAllen · August 31, 2020, 11:52am

Hi,

I made a mock dataset but couldn’t reproduce the error you met.

Since you are using ipython, you can use %debug magic to get more context information. Could you print out the g.edges() at error? It might be related to our save/load utility functions.

As below,
https://colab.research.google.com/drive/1QAeecntAv0dJQmvuHjkfzU6BzeQNi-UU#scrollTo=N5BL53mRVaJX

VoVAllen · August 31, 2020, 12:05pm

Could you try add CUDA_LAUNCH_BLOCKING=1 as env variable and see whether the error occurs at the same place?

galbraun1337 · August 31, 2020, 2:48pm

So thank you for the code example

I missed the fact that i need also to do:
model = model.to("cuda")

Now working! thank you so much!

zihao · September 1, 2020, 7:18pm

Okay that make sense, but actually I do think DGL should throw more friendly debugging information, RuntimeError: CUDA error: an illegal memory access was encountered is not informative at all.