Any suggestions on accelerating the mailbox and node.data of DGL?


#1

I used DGL to implement a graph neural network, but found that the speed is not faster enough. So I did a careful checking on the time cost of each part of my message reduce function. Here is the code snippet:

https://pastebin.com/JTUkjCZn

I tested it on both GPU and CPU for the same graph. Here are the results:

=====
1. GPU mode
total time : 2.816583

Device Time : 0.020433 (0.725460%)

Mailbox Time : 0.868750 (30.844124%)

Edge Label Repeat Time : 0.049158 (1.745301%)

Edge Label Reverse Time : 0.038511 (1.367280%)

Sum Time 1 : 0.046140 (1.638171%)

Sum Time 2 : 0.048657 (1.727534%)

Sum Time 3 : 0.110803 (3.933964%)

Sum Time 4 : 0.060176 (2.136503%)

Post Time 1 : 1.522448 (54.053018%)

Post Time 2 : 0.051505 (1.828646%)

2. CPU Mode:

totaltime : 1.796776

Device Time : 0.025445 (1.416160%)

Mailbox Time : 0.089155 (4.961953%)

Edge Label Repeat Time : 0.036036 (2.005593%)

Edge Label Reverse Time : 0.028948 (1.611098%)

Sum Time 1 : 0.027066 (1.506338%)

Sum Time 2 : 0.028965 (1.612054%)

Sum Time 3 : 0.879028 (48.922498%)

Sum Time 4 : 0.542981 (30.219728%)

Post Time 1 : 0.110166 (6.131292%)

Post Time 2 : 0.028987 (1.613288%)

=======

It seems that when using GPU for the training, the data fetching from the nodes (e.g., “nodes.mailbox[‘h’]” and “nodes.data” in Lines 10-12, and Line 36) is the most time consuming part. But for CPU, it is not. So do you have any suggestion on improving the speed of this part when using GPU?


#2

Btw, I just noticed that DGL v0.2 is released. Are there any speeding-up or improvements on this part (not quite sure about this)?


#3

sorry for late reply, I’ll look into this.


#4

@zihao Hi Zihao, any suggestions on this issue? Thanks a lot.


#5

Thanks for your patience.
First, I’m not sure whether your profiling is correct. According to my experience, to get a correction time evaluation on gpu, you are supposed to call torch.cuda.synchronize() before you record the time.


#6

Plus, what kind of graph you are working on? If you write your own message function and reduce function instead of using built-in ones, by default DGL would use degree bucketing to do auto-batching.
This being said, if the node degrees in your graph have high variance, the computation might not be as efficient as you expect. The training of Tree-LSTM is fast because all nodes have the same degree in the graph.
However, we always have ways to accelerate special cases: writing custom CUDA kernels; however we can never cover all kind of applications(degree-bucketing is general enough, but not efficient in some cases, as I mentioned above). It would be more helpful if you tell us what kind of graph you are dealing with, and what kind of message/reduce function you would like to use. We could discuss by email if it’s not convenient to show them in public.


#7

Thanks a lot for your reply, @zihao. I am using the built-in functions (e.g., mailbox, etc.). The code I posted is actually my reduce function.

The input graphs of my project has variant degrees, which is not like a tree and does not have the same node degrees across nodes. Perhaps it is the reason.


#8

Actually, my “built-in” function refers to functions defined in dgl.function(e.g. dgl.function.sum, dgl.function.src_mul_edge). DGL would call more efficient kernels rather than do degree bucketing if it detects some kind of built-in function combinations.


#9

Hi @jiayouwyhit. Thanks for the questions. I want to emphasize again the most important points that @zihao has mentioned.

For the difference between CPU and GPU profiling results. As you know, PyTorch operators are executed asynchronously on GPU, which means the profiling results for GPU may not be reflecting the actual bottleneck. By looking at your CPU results, most of the time are spent in U_iou and U_f.

Besides, the way DGL batch reduce_func on multiple nodes is that it analyzes graph structure and batch the nodes with the same in-degree together. That also indicates if nodes in the graph has different in-degree, reduce_func will be invoked multiple times for different in-degrees. So in your profiling code, although I am not sure which profiler you are using, I see you call profiler.start without profiler.stop. But since reduce_func could potentially be invoked multiple times in one message passing round, I don’t know if the profiling results will also be affected, especially for asynchronous GPU operators.