Dose DGL support distributed training & inference?


Hi, I am new to DGL and I am really interested in how does DGL deal with large graphs?

For large graphs which may not fit in the memory of one single machine, does DGL support store graphs in a distributed way? And does DGL support distributed training & inference?


Distributed training with giant graph is our next major planning. We hope we could bring this out soon!


For the current version, I think it is possible for synchronous distributed training, by locally sub-sampling a subgraph on each device/machine, and reuse the parameter server to aggregate the gradients.
The sub-sampling trick works for training.
For inference, I’m not really sure.
Since mxnet (not sure for pytorch) does not really support model partition/parallelism, real distributed training without sub-sampling requires heavy communication in both forward and backward step, which will need heavy hacking.


I think mini-batch training is probably still preferred, even in the graph neural networks. We’ll provide users many different sampling strategies to experiment subgraph training in the distributed training setting.