Save model during distributed training

navmarri · February 17, 2021, 7:11am

I’m following the tutorial for training GraphSAGE on multiple GPUs in a single instance.
I know for single GPU mode we use torch.save(..) to save model
In this tutorial since it running on multiple GPUs, how can we aggregate the model weights from multiple GPUs to save in a single model file?

VoVAllen · February 18, 2021, 8:47am

There’s no difference here. Because loss.backward() will synchronize the weights between multiple GPU, thus after this the weights are the same on each GPU.
You can do

if proc_id == 0:
  torch.save(...)

to save the model by the first process

system · March 20, 2021, 8:47am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.