Assume we train GraphSAGE on single machine with the following settings: batch_size = B, lr = LR.
Then for DistDGL with n nodes (machines), what is the correct setting for batch_size and lr if I want to have the equivalent result as single machine training? Should it be batch_size = B/n, and lr = LR? (refer to this discussion for Pytorch distributed: Should we split batch_size according to ngpu_per_node when DistributedDataparallel - #2 by mrshenli - distributed - PyTorch Forums)
Also, there are still other parameters related when using launch.py
of DistDGL, including num_trainers
, num_samplers
, num_servers
. Would they influence the problems above (e.g. batch size should also be divided by num_trainers
)
To make it more clear, the batch_size
is the argument that is passed into DistDataLoader
. See here