I understand that the loss.backward() when training using DDP in DistDGL overlaps compute and communication by sharing the gradients among all workers in a bucketed way i.e. the all-reduce communication takes place for each bucket of gradients that have been calculated. Is there a way to break down the time taken for these 2 compute (gradient calculation) and communication (syncing gradients using all-reduce) phases?
this is all about DDP
so I think you could refer to official doc in pytorch such as Distributed communication package - torch.distributed — PyTorch 2.0 documentation or just file a post in pytorch forum.
I have created the topic on the Pytorch forum. Thank you.
great. could you help paste the link here so other community member could learn from it too?
Of course. No one has responded to it yet, but here’s the link for anyone to follow up - https://discuss.pytorch.org/t/breaking-down-compute-and-communication-in-loss-backward/182526
1 Like
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.