Dose DGL support distributed training & inference?


Hi, I am new to DGL and I am really interested in how does DGL deal with large graphs?

For large graphs which may not fit in the memory of one single machine, does DGL support store graphs in a distributed way? And does DGL support distributed training & inference?



Distributed training with giant graph is our next major planning. We hope we could bring this out soon!



For the current version, I think it is possible for synchronous distributed training, by locally sub-sampling a subgraph on each device/machine, and reuse the parameter server to aggregate the gradients.
The sub-sampling trick works for training.
For inference, I’m not really sure.
Since mxnet (not sure for pytorch) does not really support model partition/parallelism, real distributed training without sub-sampling requires heavy communication in both forward and backward step, which will need heavy hacking.



I think mini-batch training is probably still preferred, even in the graph neural networks. We’ll provide users many different sampling strategies to experiment subgraph training in the distributed training setting.



What would the devs recommend for training with something like Ray on a large cluster? Have each actor train/run inference on a single GPU and average/combine weights when desired? I know that’s probably a very niche use case.



The distributed graph training is still under construction. Could you elaborate more about your specific case, such as graph size, node numbers, edge numbers and feature dimension? Also I think it would be better if you could start a new post, which is easier for other people to follow up.