Inconsistency of GPU-CPU transfer time between NvProf and measurement in python

I try to do some benchmarking of the SDDMM operator on GPU and CPU. However, I observe some inconsistency between the data transfer time measurement done in python and NvProf. (data transfer time between host/cpu and devices/gpu)

Here is the problem: Based on the python measurement, the Host to Device transfer time should be 123.78ms(transfer graph)+672.80ms(transfer feature vector). However, the time reported by NvProf suggest the Host to Device data transfer time is 672.59ms, which is the total time for Host to Device data transfer time even it consists of three calls. Obviously, the graph transfer time is not accounted.

On the other hand, the Device to Host data seems align with each other. The measurement done in python suggest the time is 1784.44ms + 7.0600ms. The NvProf report it as 1.78915s, which is the total time for Host to Device data transfer time even it consists of three calls. It seems both methods agree with each other.

The kernel execution time more or less agree with each other. 2.42782s/13 = 186.77ms. The NvProf agree with the measurement in python.

As such, I would like to know which one is the most accurate one that report the data transfer time? Is there any bug in the benchmarking code that makes the data transfer time inaccurate?

Any help is appreciated!

I attach the benchmarking code, time reported by python, and measurement reported by NvProf.

Here is the benchmarking code.

import dgl
import torch 
import dgl.ops as F
import time
import csv

cpu_dry_run = gpu_dry_run=3
cpu_benchmark_run =gpu_benchmark_run = 10

f = open('./result_dgl_test.csv', 'w')
writer = csv.writer(f)

feature_size_list = [10240]
nodes_idx_list = [100000]
node_num_factor_list = [10]
# choose node ID from nodes_num
writer.writerow(["feature size","total nodes num", "num edges","CPU Time","GPU Time"])

for node_num_factor in node_num_factor_list:
    for nodes_idx in nodes_idx_list:
        for feature_size in feature_size_list:
            # nodes ID range from 0 to nodes_num -1
            # the actual number of nodes is the nodes_num*edge_num_factor

            src = torch.randint(nodes_idx,(nodes_idx*node_num_factor,))
            dst = torch.randint(nodes_idx,(nodes_idx*node_num_factor,))
            # construct the graph
            g = dgl.graph((src, dst))  
            print(g.device)

            y=torch.randn(g.num_nodes(), feature_size)

            # transfer the graph
            torch.cuda.synchronize()
            start_graph_transfer = time.time()
            g_cuda = g.to('cuda:0')
            torch.cuda.synchronize()
            end_graph_transfer = time.time()
            print("graph HtoD time: ",(end_graph_transfer - start_graph_transfer)*1000,"ms")


            # transfer the feature vector
            start_feature_transfer = time.time()
            y_cuda = y.to('cuda:0')
            torch.cuda.synchronize()
            end_feature_transfer = time.time()
            print("feature HtoD time: ",(end_feature_transfer - start_feature_transfer)*1000,"ms")

            # do some dry run
            for i in range(gpu_dry_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            # synchronize 
            torch.cuda.synchronize()

            start = time.time()
            for i in range(gpu_benchmark_run):
                test_cuda=F.gsddmm(g_cuda,"dot",y_cuda,y_cuda,"u","v")
            torch.cuda.synchronize()
            end = time.time()
            gpu_time_per_iteration = (end - start)/float(gpu_benchmark_run)

            print("gpu",gpu_time_per_iteration*1000," ms")


            start_result_transfer = time.time()
            y_host=y_cuda.to(torch.device('cpu'))
            torch.cuda.synchronize()
            end_result_transfer = time.time()
            print("feature DtoH time: ",(end_result_transfer - start_result_transfer)*1000,"ms")

            start_graph_post_transfer = time.time()
            g_host=g_cuda.to(torch.device('cpu'))
            torch.cuda.synchronize()
            end_graph_post_transfer = time.time()
            print("graph DtoH time: ",(end_graph_post_transfer - start_graph_post_transfer)*1000,"ms")

            for i in range(cpu_dry_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")

            start = time.time()
            for i in range(cpu_benchmark_run):
                test=F.gsddmm(g,"dot",y,y,"u","v")
            end = time.time()
            cpu_time_per_iteration = (end - start)/float(cpu_benchmark_run)
            print("cpu: ",cpu_time_per_iteration*1000," ms")

            writer.writerow([feature_size,g.num_nodes(),g.num_edges(), cpu_time_per_iteration*1000,gpu_time_per_iteration*1000])

f.close()

Output of python measurement

graph HtoD time:  123.78549575805664 ms
feature HtoD time:  672.8084087371826 ms
gpu 186.7778778076172  ms
feature DtoH time:  1784.4436168670654 ms
graph DtoH time:  7.060050964355469 ms
cpu:  2687.334632873535  ms

output from NvProf:
$nvprof --openacc-profiling off python test.py

==20523== Profiling application: python test.py
==20523== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   49.65%  2.42782s        13  186.76ms  186.73ms  186.80ms  void dgl::aten::cuda::SDDMMCooTreeReduceKernel<long, float, bool=0, bool=0, int=0, int=2>(float const *, float const , dgl::aten::cuda::SDDMMCooTreeReduceKernel<long, float, bool=0, bool=0, int=0, int=2>*, long const *, long const , long const , long, long, long, long, long const *, long const , long, long, long)
                   36.59%  1.78915s         3  596.38ms  2.4019ms  1.78375s  [CUDA memcpy DtoH]
                   13.76%  672.59ms         3  224.20ms  1.3035ms  669.99ms  [CUDA memcpy HtoD]
                    0.00%  100.23us        13  7.7100us  7.3920us  8.0640us  void at::native::vectorized_elementwise_kernel<int=4, at::native::FillFunctor<float>, at::detail::Array<char*, int=1>>(int, float, at::native::FillFunctor<float>)
      API calls:   60.48%  3.96078s         7  565.83ms  3.5270us  1.86660s  cudaDeviceSynchronize
                   39.45%  2.58347s         6  430.58ms  1.2656ms  1.78431s  cudaMemcpyAsync
                    0.04%  2.8964ms         3  965.46us  171.75us  2.5392ms  cudaMalloc
                    0.01%  381.72us         2  190.86us  184.74us  196.97us  cuDeviceTotalMem
                    0.00%  196.90us       199     989ns      92ns  44.063us  cuDeviceGetAttribute
                    0.00%  192.80us        26  7.4150us  4.5920us  29.041us  cudaLaunchKernel
                    0.00%  191.61us         4  47.902us  4.2050us  174.83us  cudaStreamSynchronize
                    0.00%  161.05us         2  80.524us  79.216us  81.832us  cudaGetDeviceProperties
                    0.00%  65.128us       365     178ns      91ns  3.5310us  cuGetProcAddress
                    0.00%  60.072us       171     351ns     193ns  2.4380us  cudaGetDevice
                    0.00%  34.584us         2  17.292us  16.233us  18.351us  cuDeviceGetName
                    0.00%  12.732us         3  4.2440us  1.9840us  8.1440us  cudaStreamIsCapturing
                    0.00%  11.261us         4  2.8150us  1.5570us  5.4280us  cudaSetDevice
                    0.00%  7.3280us        29     252ns     162ns     684ns  cudaGetLastError
                    0.00%  5.8970us         1  5.8970us  5.8970us  5.8970us  cuDeviceGetPCIBusId
                    0.00%  2.1270us         1  2.1270us  2.1270us  2.1270us  cuInit
                    0.00%  1.6290us         4     407ns     114ns  1.0420us  cuDeviceGetCount
                    0.00%  1.0850us         3     361ns     154ns     704ns  cudaGetDeviceCount
                    0.00%     996ns         3     332ns      94ns     561ns  cuDeviceGet
                    0.00%     589ns         3     196ns     104ns     369ns  cuDevicePrimaryCtxGetState
                    0.00%     454ns         2     227ns     187ns     267ns  cuDeviceGetUuid
                    0.00%     146ns         1     146ns     146ns     146ns  cuDriverGetVersion

Thanks!