Distributed training on many independent graphs

kristianeschenburg · March 3, 2021, 1:15am

Hi all

I’m reading through the documentation on distributed training for a node classification setting. My dataset consists of M independent graphs (there are no edges between graphs) – each graph g_{i} has its own feature matrix x_{i} and “true” node labels y_{i}. All training graphs are saved as a single list of DGL graph objects.

I’m currently training my model on n mini-batches on a single machine. Each mini-batch, b_{i} for i=1:n, consists of a |b_{i}| independent graphs where the mini-batches are resampled at the start of every epoch. Currently, the training is distributed across all cores on a single machine, iterating over every mini-batch, and accumulating the gradients. I’m interested in using a cluster of K machines for training. How can I go about doing this?

Thanks

Kristian

VoVAllen · March 3, 2021, 6:23am

It should be the same as multiprocessing on the same machine. Using DistributedDataParallel should be similar to other distributed training examples such as image classification. I don’t think any extra components from DGL is needed

system · April 2, 2021, 6:24am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.