Trouble shooting/potential bugs: gsddmm operator have same runtime on GPU

I am trying to do some performance measurements on gsddmm kernel. However, I found the time reported by dgl.ops.gsddmm kernel is the same for running on GPU regardless of number of nodes, number of edges and node’s feature dimensions.

On CPU, I can see performance variation across number of nodes, number of edges and node’s feature dimensions. I am speculating that there is a bug in implementing the GSDDMM kernel for GPU. I attached some performance result at the end. Any help on this issue is appreciated!

Here is the setup of the software. backend: pytorch 1.9.0. CUDA 11.2. DGL: df7a612 (build from source). OS: Ubuntu 16.04. It is running on a single GPU.

Here is the code I am using to benchmark.

import dgl
import torch as th
import dgl.ops as F
import time
import csv

cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10



feature_size_list = [1,2,4,8,16,32,64,100,128,200,256,300,400,512,1024]

nodes_num_list = [100,1000,10000,100000]
edge_num_factor_list = [10,16,20,32,64]


for edge_num_factor in edge_num_factor_list:
    for nodes_num in nodes_num_list:
        for feature_size in feature_size_list:
        
            # print(feature_size)
            src = th.randint(nodes_num,(nodes_num*edge_num_factor,))
            dst = th.randint(nodes_num,(nodes_num*edge_num_factor,))

            g = dgl.graph((src, dst))  

            y = th.arange(1, feature_size*nodes_num+1).float().view(nodes_num, feature_size).requires_grad_()

            # measure the performance on cuda.
            g_cuda = g.to(th.device('cuda:0'))

            y_cuda = y.to(th.device('cuda:0'))

            for i in range(gpu_dry_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")

            start = time.time()
            for i in range(gpu_benchmark_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")

            end = time.time()
            gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)
            print("gpu",gpu_time_per_iteration*1000," ms")

            for i in range(cpu_dry_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")

            start = time.time()
            for i in range(cpu_benchmark_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")
            end = time.time()
            cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
            print("cpu: ",cpu_time_per_iteration*1000," ms")

Moreover, sometime, I run into the following errors. But I have no idea where it comes.

Traceback (most recent call last):
  File "test.py", line 46, in <module>
    test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/ops/sddmm.py", line 75, in gsddmm
    g._graph, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 766, in gsddmm
    return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 311, in forward
    out = _gsddmm(gidx, op, X, Y, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/sparse.py", line 505, in _gsddmm
    lhs_target, rhs_target)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [00:19:40] /home/Software/dgl/src/array/./check.h:57: Check failed: gdim[uev_idx[i]] == arrays[i]->shape[0] (9999 vs. 10000) : Expect U_data to have size 9999 on the first dimension, but got 10000

Also, I understand that DGL use Foreign Function Interface to call C++/CUDA kernel underneath. Will the execution of SDDMM kernel is long enough to make overhead of FFI less than 1%?

Performance data. Columns are: feature size, number of nodes, number of edges, CPU time(ms), GPU time(ms). It seems that the GPU time is the same. I also tried different number of nodes and edges. The GPU kernel time is the same.

Feature size, number of nodes, number of edges, CPU time, GPU time        
1	100000	6400000	7.543158531	0.037384033
2	100000	6400000	10.58959961	0.037407875
4	100000	6400000	16.30623341	0.037384033
8	100000	6400000	22.30067253	0.037121773
16	100000	6400000	33.11469555	0.037932396
32	100000	6400000	121.0481405	0.037240982
64	100000	6400000	234.2772961	0.036215782
100	100000	6400000	376.5076637	0.036454201
128	100000	6400000	489.2184258	0.036096573
200	100000	6400000	715.1563883	0.03657341
256	100000	6400000	822.497654	0.036239624
300	100000	6400000	950.1194	0.036382675
400	100000	6400000	1111.607933	0.03657341
512	100000	6400000	1176.187634	0.036287308
1024	100000	6400000	1912.621617	0.036501884

Thanks!

All GPU kernels are launched asynchronously so using time.time() will not be accurate. Please check out PyTorch’s guide on how to measure cuda operators: CUDA semantics — PyTorch 1.11.0 documentation

1 Like

Thanks a lot! However, there are some other runtime error exists. (Now I can measure the time correctly by adding torch.cuda.synchronize() between the start timer and end timer.)

The following errors still exist. (log attached) Do you have any suggestion to fix it?

This bugs seems to be pretty random. It can happen at the beginning of the iteration or in the middle of the iteration. Or, it may not happen at all.

I already did a full synchronize between the iterations. So, I don’t think it is related to the sync. issue on the CUDA device.

I also make sure the CUDA toolkit version is matching with the Pytorch CUDA toolkit requirement. I also build the DGL using the CUDA toolkit again. However, the error still exist.

Appreciate any help.

Traceback (most recent call last):
  File "test.py", line 46, in <module>
    test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/ops/sddmm.py", line 75, in gsddmm
    g._graph, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 766, in gsddmm
    return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 311, in forward
    out = _gsddmm(gidx, op, X, Y, lhs_target, rhs_target)
  File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/sparse.py", line 505, in _gsddmm
    lhs_target, rhs_target)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [00:19:40] /home/Software/dgl/src/array/./check.h:57: Check failed: gdim[uev_idx[i]] == arrays[i]->shape[0] (9999 vs. 10000) : Expect U_data to have size 9999 on the first dimension, but got 10000

Here is my current code:

import dgl
import torch as th
import dgl.ops as F
import time
import csv

cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10

f = open('./result_dgl_test.csv', 'w')
writer = csv.writer(f)
writer.writerow(["feature size","total nodes num", "num edges","CPU Time","GPU Time"])

feature_size_list = [1,2,4,8,16,32,64,100,128,200,256,300,400,512,1024]

nodes_num_list = [1000,10000,100000]
edge_num_factor_list = [1,2,3,4,5,6,7,8,10]


for edge_num_factor in edge_num_factor_list:
    for nodes_num in nodes_num_list:
        for feature_size in feature_size_list:
        
            # print(feature_size)
            src = th.randint(nodes_num,(nodes_num*edge_num_factor,))
            dst = th.randint(nodes_num,(nodes_num*edge_num_factor,))

            g = dgl.graph((src, dst))  

            y = th.arange(1, feature_size*nodes_num+1).float().view(nodes_num, feature_size).requires_grad_()

            # measure the performance on cuda.
            g_cuda = g.to(th.device('cuda:0'))

            y_cuda = y.to(th.device('cuda:0'))
            th.cuda.synchronize()
            for i in range(gpu_dry_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            th.cuda.synchronize()
            start = time.time()
            for i in range(gpu_benchmark_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            th.cuda.synchronize()
            end = time.time()
            gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)
            print("gpu",gpu_time_per_iteration*1000," ms")

            for i in range(cpu_dry_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")

            start = time.time()
            for i in range(cpu_benchmark_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")
            end = time.time()
            cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
            print("cpu: ",cpu_time_per_iteration*1000," ms")
            writer.writerow([feature_size,nodes_num,g.num_edges(), cpu_time_per_iteration*1000,gpu_time_per_iteration*1000])

f.close()

Thanks in advance!

I think this is the root cause. pls check here: Strange runtime errors for SDDMM kernel - #2 by Rhett-Ying