Save model during distributed training

I’m following the tutorial for training GraphSAGE on multiple GPUs in a single instance.
I know for single GPU mode we use torch.save(..) to save model
In this tutorial since it running on multiple GPUs, how can we aggregate the model weights from multiple GPUs to save in a single model file?

There’s no difference here. Because loss.backward() will synchronize the weights between multiple GPU, thus after this the weights are the same on each GPU.
You can do

if proc_id == 0:
  torch.save(...)

to save the model by the first process

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.