CUDA Index Error from apply_edge

keyit · June 2, 2020, 4:22pm

Keep getting CUDA index error from apply_edge function.

/opt/conda/conda-bld/pytorch_1573049305765/work/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [415,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1573049305765/work/aten/src/THC/THCReduceAll.cuh line=327 error=59 : device-side assert triggered

    g.apply_edges(u_dot_ve('k_', 'r_', 'q_', 'qrk_'))
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/heterograph.py", line 2704, in apply_edges
    Runtime.run(prog)
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/runtime/runtime.py", line 11, in run
    exe.run()
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/runtime/ir/executor.py", line 204, in run
    udf_ret = fn_data(src_data, edge_data, dst_data)
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/runtime/scheduler.py", line 972, in _mfunc_wrapper
    return mfunc(ebatch)
  File "/home/keyit/code/alan-framework/modules/semantic_parser/model/graph_encoders.py", line 840, in func
    u = es.src[src_field]
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/utils.py", line 285, in __getitem__
    return self._fn(key)
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/frame.py", line 655, in <lambda>
    return utils.LazyDict(lambda key: self._frame[key][rows], keys=self.keys())
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/frame.py", line 96, in __getitem__
    user_idx = idx.tousertensor(F.context(self.data))
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/utils.py", line 105, in tousertensor
    self._user_tensor_data[ctx] = F.copy_to(data, ctx)
  File "/home/keyit/anaconda3/envs/alan/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 95, in copy_to
    return input.cuda()
RuntimeError: CUDA error: device-side assert triggered

Function To Apply

def u_dot_ve(src_field, edg_field, dst_field, out_field):
    def func(es):
        v = es.dst[dst_field]
        e = es.data[edg_field]
        u = es.src[src_field]
        ve = v + e
        ve_u = ve * u
        ve_u_sum = torch.sum(ve_u, dim=-1, keepdim=True)
        return {out_field: ve_u_sum}
    return func

Call apply_edge

g = dgl.to_homo(hg)

num_nodes = g.number_of_nodes()
num_edges = g.number_of_edges()

k = self.k_w(x).view(num_nodes, self.num_heads, -1).contiguous()
q = self.q_w(x).view(num_nodes, self.num_heads, -1).contiguous()
v = self.v_w(x).view(num_nodes, self.num_heads, -1).contiguous()
r = r.view(num_edges, self.num_heads, -1).contiguous()

g.ndata['k_'] = k
g.ndata['q_'] = q
g.ndata['v_'] = v
g.edata['r_'] = r

# --- Compute attention score ---
# k * (q + r) --> qkr (es, hs, 1)
g.apply_edges(u_dot_ve('k_', 'r_', 'q_', 'qrk_'))

Please help me debug on this error. I checked k, r, and q. Which looks good to me.

keyit · June 4, 2020, 1:38am

Could anyone give a hint what might cause this issue?

mufeili · June 4, 2020, 3:00am

Can you provide a minimum block of code that others can run to reproduce the error?

VoVAllen · June 4, 2020, 3:05am

Hi,

I cannot reproduce the error at my side. Could provide the shape of qkvr? And is there any duplicate edges in your graph or is there any removed edges (by remove_edges) in your graph?

keyit · June 4, 2020, 3:57am

Thanks for helping out.
The shape of qkv are all (256, 8, 32). r is (17081, 8, 32).
There is no duplicate edges. No edges removed from the graph.

hg is a batched heterogeneous graph.

VoVAllen · June 8, 2020, 7:36am

Hi, I cannot reproduce your error. Could you provide a minimal reproducible code? My code snippet is as below

import dgl
import torch as th
import numpy as np
from scipy import sparse as spsp
import torch

def u_dot_ve(src_field, edg_field, dst_field, out_field):
    def func(es):
        v = es.dst[dst_field]
        e = es.data[edg_field]
        u = es.src[src_field]
        ve = v + e
        ve_u = ve * u
        ve_u_sum = torch.sum(ve_u, dim=-1, keepdim=True)
        return {out_field: ve_u_sum}
    return func



def load_random_graph():
    n_nodes = 256
    n_edges = 17081

    row = np.random.RandomState(6657).choice(n_nodes, n_edges)
    col = np.random.RandomState(6657).choice(n_nodes, n_edges)
    # row = np.arange(n_nodes)
    # col = np.arange(n_nodes)
    spm = spsp.coo_matrix((np.ones(len(row)), (row, col)), shape=(n_nodes, n_nodes))
    g = dgl.graph(spm)

    return g

g = load_random_graph()

num_nodes = g.number_of_nodes()
num_edges = g.number_of_edges()

k = th.randn((256, 8, 32)).to("cuda")
q = th.randn((256, 8, 32)).to("cuda")
v = th.randn((256, 8, 32)).to("cuda")
r = th.randn((17081, 8, 32)).to("cuda")

g.ndata['k_'] = k
g.ndata['q_'] = q
g.ndata['v_'] = v
g.edata['r_'] = r

# --- Compute attention score ---
# k * (q + r) --> qkr (es, hs, 1)
g.apply_edges(u_dot_ve('k_', 'r_', 'q_', 'qrk_'))
print(g.edata['qrk_'])

keyit · June 9, 2020, 3:07pm

Thanks for help. Found the problem: the node index in edges did not align with the node index. After fixed this problem, the CUDA error went away.

When I create the heterogenous graph, the node indices in edges are larger than the upper bound of graph node indices. But I didn’t see any warnings or error. Add a validation for such misalignment when creating a graph will makes debug much easier.

VoVAllen · June 10, 2020, 7:06am

Hi,

How did you create the graph? It seems a bug for us.

keyit · June 13, 2020, 2:05am

I was using list of node indices, like:

ratings = dgl.heterograph(
    {('user', '+1', 'movie') : [(0, 0), (0, 1), (1, 0)],
     ('user', '-1', 'movie') : [(2, 1)]})

VoVAllen · June 22, 2020, 6:17am

Hi,

I cannot not reproduce the error by the example you provided above. Could you provide more details so that we can find the error? Thanks!