Dose DGL support distributed training & inference?

#1

Hi, I am new to DGL and I am really interested in how does DGL deal with large graphs?

For large graphs which may not fit in the memory of one single machine, does DGL support store graphs in a distributed way? And does DGL support distributed training & inference?

0 Likes

#2

Distributed training with giant graph is our next major planning. We hope we could bring this out soon!

0 Likes

#3

For the current version, I think it is possible for synchronous distributed training, by locally sub-sampling a subgraph on each device/machine, and reuse the parameter server to aggregate the gradients.
The sub-sampling trick works for training.
For inference, I’m not really sure.
Since mxnet (not sure for pytorch) does not really support model partition/parallelism, real distributed training without sub-sampling requires heavy communication in both forward and backward step, which will need heavy hacking.

0 Likes

#4

I think mini-batch training is probably still preferred, even in the graph neural networks. We’ll provide users many different sampling strategies to experiment subgraph training in the distributed training setting.

0 Likes

#5

What would the devs recommend for training with something like Ray on a large cluster? Have each actor train/run inference on a single GPU and average/combine weights when desired? I know that’s probably a very niche use case.

0 Likes

#6

The distributed graph training is still under construction. Could you elaborate more about your specific case, such as graph size, node numbers, edge numbers and feature dimension? Also I think it would be better if you could start a new post, which is easier for other people to follow up.

0 Likes