A question about distributed sampler

curiosityyy · July 25, 2019, 11:26am

Thanks for your excellent work.
And I have a question about distributed sampler. It seems that DGL stores a graph on one machine, but uses samplers on different machines to sample it. I don’t understand why it can improve performance, and I think that maybe it will require lots of network communication?

zihao · July 26, 2019, 3:00am

Our practice suggests the bottleneck of large-scale graph neural network training is not at computation but at graph sampling, and the distributed sampler aims to accelerate this part.

@zhengda1936 could you please provide more details?

curiosityyy · July 26, 2019, 3:56am

Yes, I agree with you.
But my question is that because DGL stores graph on one machine, so distributed samplers need to do sampling on this machine, and then send the sample to the training machine. And these processes will produce lots of network communications?

zhengda1936 · August 12, 2019, 7:48am

sorry for the late reply. The sampled results are usually just graph structures (stored in the CSR format). Its storage size is pretty size. We found that transferring the graph structure is usually not the bottleneck.

JakeStevens · October 31, 2019, 10:41pm

I’m a bit confused still as well. If DGL is storing the graph on only one machine, does that mean that the distributed samplers are multiple processes running on the same machine, all trying to sample from the same graph memory? Or is the graph stored on one machines, and multiple machines are communicating over the network to access/sample that graph?

This is interesting… do you have any more information on this, such as how you benchmarked this? I’m also curious if the set of subgraphs to be sampled could not just be stored ahead of time-- I think that the implementation for cluster-gcn does this