Comparison of DistDGL execution time for 2 and 4 nodes

Cow-Kite · May 25, 2024, 11:36am

I’m trying to implement distributed learning using DistDGL. dgl/examples/distributed/graphsage at master · dmlc/dgl · GitHub I implemented the README.md at this address verbatim.
In the process of comparing distributed learning on two and four nodes, there is a question about the time taken for each train epoch.

This is the train epoch information of two nodes

This is the train epoch information of four nodes

When there are 4 nodes, the “Mean step time” takes about twice as long as when there are 2 nodes.
Since the batch size is the same, it seems that the “Mean step time” should be similar, but the results are not.
Why do I see these results? Is there a communication problem?

tycc · May 28, 2024, 9:06am

From the graph you gave, you can see that the batch size is the same, the number of steps will be reduced by half, but the time of steps has increased from 0.1 to 0.2, and the time of epcoh and the time of later decomposition can not be seen to change significantly.

If we look at this distributed system as a whole, although the batchsize is set the same, but for the system, from two machines to four machines, the batchsize actually becomes twice the original. The amount of computation needed is twice as much as the original, so it seems more reasonable for the STEP time to become twice as much as the original.

Also, the time in the picture is twice as long or even a little bit more, so I think maybe there’s a communication problem. Can you give some communication metrics for reference?

Cow-Kite · May 30, 2024, 8:42am

Can you explain this in more detail?
I don’t understand the increase in time because the computation is doubled with 4 nodes compared to 2 nodes.
A simple way to think about it is that the overall size of the dataset is the same whether you train on 2 or 4 nodes.

tycc · May 30, 2024, 1:53pm

I have two different machines available for training and the results are shown in the figure

An experiment I did yesterday:

num_epochs is 1.
There is a clear difference in the forward.

Today I repeated the experiment on one of the nodes:

num_epochs is 1

图片1883×222 22.6 KB
num_epochs is 2

图片1114×1173 162 KB

I can’t understand the change in forward

system · June 29, 2024, 1:53pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.