Question about synchronization in training RGCN on multi-GPU

hellomirro · October 30, 2021, 2:35am

Hi, I have a question about training RGCN models on single-machine with multi-GPU (See code entity_classify_mp.py). It uses DistributedDataParallel to synchronize gradients on different processes. Why does it still need th.distributed.barrier() ?

I think th.distributed.barrier() waits for all process job to be completed. Since DistributedDataParallel only syncs gradients after all process job is done, this step can ensure all process job complete, if so, why still need th.distributed.barrier() ?

minjie · November 1, 2021, 7:02am

At the time when the code was developed, PyTorch still requires manual barrier to ensure gradient synchronization has been finished. They later eliminate the requirement and make backward() blocking. Apart from this code legacy, the barrier also ensures that the evaluation on proc#0 is finished before the next iteration.

hellomirro · November 1, 2021, 5:43pm

Thanks Minjie! I think it makes more sense that this barrier is to ensure the evaluation on proc#0 to finish, instead of synchronizing processes to get the avg of gradient.

If this barrier is to synchronize processes to get avg of gradient, it should be placed inside of the for loop of dataloader. But in the code, it is outside of the iteration of dataloader. It is inside of the for loop of epochs, which means it synchronizes after each epoch finishes. Is it necessary to synchronize after each epoch finishes?

minjie · November 3, 2021, 1:51pm

It should not be necessary.

system · December 3, 2021, 1:51pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.