[DistDGL] Should we split batch_size in distributed setting if we want the equivalent result as training on Single node? And How?

Assume we train GraphSAGE on single machine with the following settings: batch_size = B, lr = LR.

Then for DistDGL with n nodes (machines), what is the correct setting for batch_size and lr if I want to have the equivalent result as single machine training? Should it be batch_size = B/n, and lr = LR? (refer to this discussion for Pytorch distributed: Should we split batch_size according to ngpu_per_node when DistributedDataparallel - #2 by mrshenli - distributed - PyTorch Forums)

Also, there are still other parameters related when using launch.py of DistDGL, including num_trainers, num_samplers, num_servers. Would they influence the problems above (e.g. batch size should also be divided by num_trainers)

To make it more clear, the batch_size is the argument that is passed into DistDataLoader. See here

if you don’t change the learning rate, you probably need to reduce the batch size = B/n when you increase the number of trainers n.

However, you can always use a larger learning rate. you can try linearly increasing the learning rate with n or square root of n.

Thanks for your reply. I think for num_nodes, it is the same as num_trainers? So I should reduce the batch size to B / (num_nodes*num_trainers) if I don’t change the learning rate, right?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.