I am trying to do some performance measurements on gsddmm kernel. However, I found the time reported by dgl.ops.gsddmm kernel is the same for running on GPU regardless of number of nodes, number of edges and node’s feature dimensions.
On CPU, I can see performance variation across number of nodes, number of edges and node’s feature dimensions. I am speculating that there is a bug in implementing the GSDDMM kernel for GPU. I attached some performance result at the end. Any help on this issue is appreciated!
Here is the setup of the software. backend: pytorch 1.9.0. CUDA 11.2. DGL: df7a612 (build from source). OS: Ubuntu 16.04. It is running on a single GPU.
Here is the code I am using to benchmark.
import dgl
import torch as th
import dgl.ops as F
import time
import csv
cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10
feature_size_list = [1,2,4,8,16,32,64,100,128,200,256,300,400,512,1024]
nodes_num_list = [100,1000,10000,100000]
edge_num_factor_list = [10,16,20,32,64]
for edge_num_factor in edge_num_factor_list:
for nodes_num in nodes_num_list:
for feature_size in feature_size_list:
# print(feature_size)
src = th.randint(nodes_num,(nodes_num*edge_num_factor,))
dst = th.randint(nodes_num,(nodes_num*edge_num_factor,))
g = dgl.graph((src, dst))
y = th.arange(1, feature_size*nodes_num+1).float().view(nodes_num, feature_size).requires_grad_()
# measure the performance on cuda.
g_cuda = g.to(th.device('cuda:0'))
y_cuda = y.to(th.device('cuda:0'))
for i in range(gpu_dry_run):
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
start = time.time()
for i in range(gpu_benchmark_run):
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
end = time.time()
gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)
print("gpu",gpu_time_per_iteration*1000," ms")
for i in range(cpu_dry_run):
test=F.gsddmm(g,"dot",y,y,"u","v")
start = time.time()
for i in range(cpu_benchmark_run):
test=F.gsddmm(g,"dot",y,y,"u","v")
end = time.time()
cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
print("cpu: ",cpu_time_per_iteration*1000," ms")
Moreover, sometime, I run into the following errors. But I have no idea where it comes.
Traceback (most recent call last):
File "test.py", line 46, in <module>
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/ops/sddmm.py", line 75, in gsddmm
g._graph, op, lhs_data, rhs_data, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 766, in gsddmm
return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 311, in forward
out = _gsddmm(gidx, op, X, Y, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/sparse.py", line 505, in _gsddmm
lhs_target, rhs_target)
File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [00:19:40] /home/Software/dgl/src/array/./check.h:57: Check failed: gdim[uev_idx[i]] == arrays[i]->shape[0] (9999 vs. 10000) : Expect U_data to have size 9999 on the first dimension, but got 10000
Also, I understand that DGL use Foreign Function Interface to call C++/CUDA kernel underneath. Will the execution of SDDMM kernel is long enough to make overhead of FFI less than 1%?
Performance data. Columns are: feature size, number of nodes, number of edges, CPU time(ms), GPU time(ms). It seems that the GPU time is the same. I also tried different number of nodes and edges. The GPU kernel time is the same.
Feature size, number of nodes, number of edges, CPU time, GPU time
1 100000 6400000 7.543158531 0.037384033
2 100000 6400000 10.58959961 0.037407875
4 100000 6400000 16.30623341 0.037384033
8 100000 6400000 22.30067253 0.037121773
16 100000 6400000 33.11469555 0.037932396
32 100000 6400000 121.0481405 0.037240982
64 100000 6400000 234.2772961 0.036215782
100 100000 6400000 376.5076637 0.036454201
128 100000 6400000 489.2184258 0.036096573
200 100000 6400000 715.1563883 0.03657341
256 100000 6400000 822.497654 0.036239624
300 100000 6400000 950.1194 0.036382675
400 100000 6400000 1111.607933 0.03657341
512 100000 6400000 1176.187634 0.036287308
1024 100000 6400000 1912.621617 0.036501884
Thanks!