I used DGL to implement a graph neural network, but found that the speed is not faster enough. So I did a careful checking on the time cost of each part of my message reduce function. Here is the code snippet:
I tested it on both GPU and CPU for the same graph. Here are the results:
=====
1. GPU mode
total time : 2.816583
Device Time : 0.020433 (0.725460%)
Mailbox Time : 0.868750 (30.844124%)
Edge Label Repeat Time : 0.049158 (1.745301%)
Edge Label Reverse Time : 0.038511 (1.367280%)
Sum Time 1 : 0.046140 (1.638171%)
Sum Time 2 : 0.048657 (1.727534%)
Sum Time 3 : 0.110803 (3.933964%)
Sum Time 4 : 0.060176 (2.136503%)
Post Time 1 : 1.522448 (54.053018%)
Post Time 2 : 0.051505 (1.828646%)
2. CPU Mode:
totaltime : 1.796776
Device Time : 0.025445 (1.416160%)
Mailbox Time : 0.089155 (4.961953%)
Edge Label Repeat Time : 0.036036 (2.005593%)
Edge Label Reverse Time : 0.028948 (1.611098%)
Sum Time 1 : 0.027066 (1.506338%)
Sum Time 2 : 0.028965 (1.612054%)
Sum Time 3 : 0.879028 (48.922498%)
Sum Time 4 : 0.542981 (30.219728%)
Post Time 1 : 0.110166 (6.131292%)
Post Time 2 : 0.028987 (1.613288%)
=======
It seems that when using GPU for the training, the data fetching from the nodes (e.g., “nodes.mailbox[‘h’]” and “nodes.data” in Lines 10-12, and Line 36) is the most time consuming part. But for CPU, it is not. So do you have any suggestion on improving the speed of this part when using GPU?