Breaking down compute and communication in DistDGL training

pranjaln · June 20, 2023, 8:38pm

I understand that the loss.backward() when training using DDP in DistDGL overlaps compute and communication by sharing the gradients among all workers in a bucketed way i.e. the all-reduce communication takes place for each bucket of gradients that have been calculated. Is there a way to break down the time taken for these 2 compute (gradient calculation) and communication (syncing gradients using all-reduce) phases?

Rhett-Ying · June 21, 2023, 8:41am

this is all about DDP so I think you could refer to official doc in pytorch such as Distributed communication package - torch.distributed — PyTorch 2.0 documentation or just file a post in pytorch forum.

pranjaln · June 26, 2023, 5:13am

I have created the topic on the Pytorch forum. Thank you.

Rhett-Ying · June 26, 2023, 5:25am

great. could you help paste the link here so other community member could learn from it too?

pranjaln · June 26, 2023, 5:29am

Of course. No one has responded to it yet, but here’s the link for anyone to follow up - https://discuss.pytorch.org/t/breaking-down-compute-and-communication-in-loss-backward/182526

system · July 26, 2023, 5:29am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.