Dose DGL support distributed training & inference?

hibayesian · February 1, 2019, 8:48am

Hi, I am new to DGL and I am really interested in how does DGL deal with large graphs?

For large graphs which may not fit in the memory of one single machine, does DGL support store graphs in a distributed way? And does DGL support distributed training & inference?

VoVAllen · February 1, 2019, 11:35am

Distributed training with giant graph is our next major planning. We hope we could bring this out soon!

xcgoner · February 1, 2019, 6:29pm

For the current version, I think it is possible for synchronous distributed training, by locally sub-sampling a subgraph on each device/machine, and reuse the parameter server to aggregate the gradients.
The sub-sampling trick works for training.
For inference, I’m not really sure.
Since mxnet (not sure for pytorch) does not really support model partition/parallelism, real distributed training without sub-sampling requires heavy communication in both forward and backward step, which will need heavy hacking.

zhengda1936 · February 1, 2019, 10:28pm

I think mini-batch training is probably still preferred, even in the graph neural networks. We’ll provide users many different sampling strategies to experiment subgraph training in the distributed training setting.

Pangurbon · March 21, 2019, 6:14pm

What would the devs recommend for training with something like Ray on a large cluster? Have each actor train/run inference on a single GPU and average/combine weights when desired? I know that’s probably a very niche use case.

VoVAllen · March 24, 2019, 7:26am

The distributed graph training is still under construction. Could you elaborate more about your specific case, such as graph size, node numbers, edge numbers and feature dimension? Also I think it would be better if you could start a new post, which is easier for other people to follow up.

weber · July 21, 2020, 10:03am

Hi team, when the distributed graph training with PyTorch will be released?

And is there any design docs about the distributed training? Thanks!

kk17 · October 8, 2022, 9:47am

any progress for the distributed training using ray?