Hi all
I’m reading through the documentation on distributed training for a node classification setting. My dataset consists of M independent graphs (there are no edges between graphs) – each graph g_{i} has its own feature matrix x_{i} and “true” node labels y_{i}. All training graphs are saved as a single list of DGL graph objects.
I’m currently training my model on n mini-batches on a single machine. Each mini-batch, b_{i} for i=1:n, consists of a |b_{i}| independent graphs where the mini-batches are resampled at the start of every epoch. Currently, the training is distributed across all cores on a single machine, iterating over every mini-batch, and accumulating the gradients. I’m interested in using a cluster of K machines for training. How can I go about doing this?
Thanks
Kristian