Questions about Training EdgeDataLoader with Heterograph RGCN for Multi-GPU

I have managed to train a Heterograph RGCN for Link Prediction and it runs perfectly on a single GPU. However, my graph is big and the training takes a while (low batch size with many nodes), so I’m looking to run the training on a multiple GPU (I don’t need to run the validation or test set on multiple GPU).

I saw a couple examples ( and ( and have some questions.

  1. The second example uses a multiprocessing queue to store the validation and test logits and the first one doesn’t. Is the queue necessary and if so why doesn’t the first example use it? Can I skip the queue if I am ok running the test dataset all on a single GPU?
  2. How can I set up EdgeDataLoader with the Heterograph?
    In the example, it shows:
n_edges = g.number_of_edges()
    train_seeds = np.arange(n_edges)
    if n_gpus > 0:
        num_per_gpu = (train_seeds.shape[0] + n_gpus -1) // n_gpus
        train_seeds = train_seeds[proc_id * num_per_gpu :
                                  (proc_id + 1) * num_per_gpu \
                                  if (proc_id + 1) * num_per_gpu < train_seeds.shape[0]
                                  else train_seeds.shape[0]]
     # now use train_seeds for EdgeDataLoader

This won’t work for a heterograph because EdgeDataLoader needs to handle the multiple edge types as it’s train_eid_dict.
I currently use EdgeDataLoader in my heterograph like so:

train_eid_dict = {canonical_etype: torch.arange(g.num_edges(canonical_etype[1]), dtype=torch.int64).to(torch.device('cuda')) for canonical_etype in g.canonical_etypes} 
dataloader = dgl.dataloading.EdgeDataLoader(g, train_eid_dict, sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5), batch_size=args.batch_size, shuffle=True, drop_last=False, pin_memory=True, num_workers=args.num_workers)

How can I create a similar train_eid_dict that does the appropriate indexing for the heterograph to run on distributed training?

Yes. (@classicsong correct me if I’m wrong)

Something like this should partition the dictionary:

eid_dict = {k: torch.chunk(v, num_gpus)[gpu_id] for k, v in eid_dict.items()}

Thanks @BarclayII , that works!

I had another question which may be more multi-GPU training related. I understand that I must do all my evaluation on the process_id 0 (one GPU), but where should I save my trained Torch models and load them. Must I do them also on process_id 0 or can I do it on all the processes?

I would suggest saving the model on process 0. Otherwise you will probably have all the processes saving a model on the same file which may or may not cause problems. Loading should be fine with all the processes.

1 Like

Ok, thanks, that makes sense!