I have managed to train a Heterograph RGCN for Link Prediction and it runs perfectly on a single GPU. However, my graph is big and the training takes a while (low batch size with many nodes), so I’m looking to run the training on a multiple GPU (I don’t need to run the validation or test set on multiple GPU).
I saw a couple examples (https://github.com/dmlc/dgl/blob/8bbc84e04fe33a5459b85ad09e92b5a78325f214/examples/pytorch/graphsage/train_sampling_unsupervised.py) and (https://github.com/dmlc/dgl/blob/1e3fcc7c5309eb3a6f61c5d03ce7f76b2843003f/examples/pytorch/rgcn/entity_classify_mp.py) and have some questions.
- The second example uses a multiprocessing queue to store the validation and test logits and the first one doesn’t. Is the queue necessary and if so why doesn’t the first example use it? Can I skip the queue if I am ok running the test dataset all on a single GPU?
- How can I set up EdgeDataLoader with the Heterograph?
In the example, it shows:
n_edges = g.number_of_edges()
train_seeds = np.arange(n_edges)
if n_gpus > 0:
num_per_gpu = (train_seeds.shape[0] + n_gpus -1) // n_gpus
train_seeds = train_seeds[proc_id * num_per_gpu :
(proc_id + 1) * num_per_gpu \
if (proc_id + 1) * num_per_gpu < train_seeds.shape[0]
else train_seeds.shape[0]]
# now use train_seeds for EdgeDataLoader
This won’t work for a heterograph because EdgeDataLoader needs to handle the multiple edge types as it’s train_eid_dict
.
I currently use EdgeDataLoader in my heterograph like so:
train_eid_dict = {canonical_etype: torch.arange(g.num_edges(canonical_etype[1]), dtype=torch.int64).to(torch.device('cuda')) for canonical_etype in g.canonical_etypes}
dataloader = dgl.dataloading.EdgeDataLoader(g, train_eid_dict, sampler, negative_sampler=dgl.dataloading.negative_sampler.Uniform(5), batch_size=args.batch_size, shuffle=True, drop_last=False, pin_memory=True, num_workers=args.num_workers)
How can I create a similar train_eid_dict
that does the appropriate indexing for the heterograph to run on distributed training?