I’m trying to implement distributed learning using DistDGL. dgl/examples/distributed/graphsage at master · dmlc/dgl · GitHub I implemented the README.md at this address verbatim.
In the process of comparing distributed learning on two and four nodes, there is a question about the time taken for each train epoch.
This is the train epoch information of two nodes
This is the train epoch information of four nodes
When there are 4 nodes, the “Mean step time” takes about twice as long as when there are 2 nodes.
Since the batch size is the same, it seems that the “Mean step time” should be similar, but the results are not.
Why do I see these results? Is there a communication problem?