Strange runtime errors for SDDMM kernel

I run into the following errors when I do some benchmarking on G-SDDMM kernel. (log attached) Is there any help on fixing it?

This bugs seems to be pretty random. It can happen at the beginning of the iteration or in the middle of the iteration. Or, it may not happen at all.

I already did a full cuda synchronize between the iterations. So, I don’t think it is related to the sync issue on the CUDA device.

I also make sure the CUDA toolkit version is matching with the Pytorch CUDA toolkit requirement. I also build the DGL using the CUDA toolkit again. However, the error still exist.

Here is the setup of the software. backend: pytorch 1.11.0. CUDA 11.3. DGL: df7a612 (build from source). OS: Ubuntu 18.04. It is running on a single GPU.

Appreciate any help.

Error log:

Traceback (most recent call last):
  File "test.py", line 46, in <module>
    test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/ops/sddmm.py", line 75, in gsddmm
    g._graph, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 766, in gsddmm
    return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 311, in forward
    out = _gsddmm(gidx, op, X, Y, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/sparse.py", line 505, in _gsddmm
    lhs_target, rhs_target)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [00:19:40] /home/Software/dgl/src/array/./check.h:57: Check failed: gdim[uev_idx[i]] == arrays[i]->shape[0] (9999 vs. 10000) : Expect U_data to have size 9999 on the first dimension, but got 10000

Here is the code I am using to benchmark.

import dgl
import torch as th
import dgl.ops as F
import time
import csv

cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10

f = open('./result_dgl_test.csv', 'w')
writer = csv.writer(f)
writer.writerow(["feature size","total nodes num", "num edges","CPU Time","GPU Time"])

feature_size_list = [1,2,4,8,16,32,64,100,128,200,256,300,400,512,1024]

nodes_num_list = [1000,10000,100000]
edge_num_factor_list = [1,2,3,4,5,6,7,8,10]


for edge_num_factor in edge_num_factor_list:
    for nodes_num in nodes_num_list:
        for feature_size in feature_size_list:
        
            # print(feature_size)
            src = th.randint(nodes_num,(nodes_num*edge_num_factor,))
            dst = th.randint(nodes_num,(nodes_num*edge_num_factor,))

            g = dgl.graph((src, dst))  

            y = th.arange(1, feature_size*nodes_num+1).float().view(nodes_num, feature_size).requires_grad_()

            # measure the performance on cuda.
            g_cuda = g.to(th.device('cuda:0'))

            y_cuda = y.to(th.device('cuda:0'))
            th.cuda.synchronize()
            for i in range(gpu_dry_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            th.cuda.synchronize()
            start = time.time()
            for i in range(gpu_benchmark_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            th.cuda.synchronize()
            end = time.time()
            gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)
            print("gpu",gpu_time_per_iteration*1000," ms")

            for i in range(cpu_dry_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")

            start = time.time()
            for i in range(cpu_benchmark_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")
            end = time.time()
            cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
            print("cpu: ",cpu_time_per_iteration*1000," ms")
            writer.writerow([feature_size,nodes_num,g.num_edges(), cpu_time_per_iteration*1000,gpu_time_per_iteration*1000])

f.close()

The error you hit is caused by unexpected num_nodes in graph.

Please specify the num_nodes explicitly like g = dgl.graph((src, dst), num_nodes=nodes_num).

If not given, num_nodes will be the largest node ID plus 1 from the [data] argument. Refer to dgl.graph — DGL 0.8.1 documentation for more details.

The way you used to create src/dst does not guarantee expected num_nodes is chosen.

Thanks! It makes sense. The random nodes array (src and dst) may not include the largest node ID, so the data array (y) may more nodes than nodes array.

After explicitly specify the number of nodes, the error goes away!