I run into the following errors when I do some benchmarking on G-SDDMM kernel. (log attached) Is there any help on fixing it?
This bugs seems to be pretty random. It can happen at the beginning of the iteration or in the middle of the iteration. Or, it may not happen at all.
I already did a full cuda synchronize between the iterations. So, I don’t think it is related to the sync issue on the CUDA device.
I also make sure the CUDA toolkit version is matching with the Pytorch CUDA toolkit requirement. I also build the DGL using the CUDA toolkit again. However, the error still exist.
Here is the setup of the software. backend: pytorch 1.11.0. CUDA 11.3. DGL: df7a612 (build from source). OS: Ubuntu 18.04. It is running on a single GPU.
Appreciate any help.
Error log:
Traceback (most recent call last):
File "test.py", line 46, in <module>
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/ops/sddmm.py", line 75, in gsddmm
g._graph, op, lhs_data, rhs_data, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 766, in gsddmm
return GSDDMM.apply(gidx, op, lhs_data, rhs_data, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/torch/cuda/amp/autocast_mode.py", line 219, in decorate_fwd
return fwd(*args, **kwargs)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 311, in forward
out = _gsddmm(gidx, op, X, Y, lhs_target, rhs_target)
File "/home/anaconda3/envs/online/lib/python3.7/site-packages/dgl-0.9-py3.7-linux-x86_64.egg/dgl/sparse.py", line 505, in _gsddmm
lhs_target, rhs_target)
File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [00:19:40] /home/Software/dgl/src/array/./check.h:57: Check failed: gdim[uev_idx[i]] == arrays[i]->shape[0] (9999 vs. 10000) : Expect U_data to have size 9999 on the first dimension, but got 10000
Here is the code I am using to benchmark.
import dgl
import torch as th
import dgl.ops as F
import time
import csv
cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10
f = open('./result_dgl_test.csv', 'w')
writer = csv.writer(f)
writer.writerow(["feature size","total nodes num", "num edges","CPU Time","GPU Time"])
feature_size_list = [1,2,4,8,16,32,64,100,128,200,256,300,400,512,1024]
nodes_num_list = [1000,10000,100000]
edge_num_factor_list = [1,2,3,4,5,6,7,8,10]
for edge_num_factor in edge_num_factor_list:
for nodes_num in nodes_num_list:
for feature_size in feature_size_list:
# print(feature_size)
src = th.randint(nodes_num,(nodes_num*edge_num_factor,))
dst = th.randint(nodes_num,(nodes_num*edge_num_factor,))
g = dgl.graph((src, dst))
y = th.arange(1, feature_size*nodes_num+1).float().view(nodes_num, feature_size).requires_grad_()
# measure the performance on cuda.
g_cuda = g.to(th.device('cuda:0'))
y_cuda = y.to(th.device('cuda:0'))
th.cuda.synchronize()
for i in range(gpu_dry_run):
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
th.cuda.synchronize()
start = time.time()
for i in range(gpu_benchmark_run):
test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
th.cuda.synchronize()
end = time.time()
gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)
print("gpu",gpu_time_per_iteration*1000," ms")
for i in range(cpu_dry_run):
test=F.gsddmm(g,"dot",y,y,"u","v")
start = time.time()
for i in range(cpu_benchmark_run):
test=F.gsddmm(g,"dot",y,y,"u","v")
end = time.time()
cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
print("cpu: ",cpu_time_per_iteration*1000," ms")
writer.writerow([feature_size,nodes_num,g.num_edges(), cpu_time_per_iteration*1000,gpu_time_per_iteration*1000])
f.close()