I have seen in some papers that using pipeline parallelism to overlap sampling and computation. Is this feature included with DGL? At the same time, I paid attention to the parallel pipeline of ordinary machine learning, splitting mini-batches into micro-batches. There are examples of this in pytorch. Is it the same thing as the pipeline parallelism in GNN?
Hi @gamdwk, for your questions:
- Is this feature included with DGL?
— Yes, DGL definitely overlap sampling and computation.
- Is it the same thing as the pipeline parallelism in GNN?
— No, for ordinary ML, pipeline parallelization usually denotes spliting the model into several stages, forming a pipeline, then each stage can deal with micro-batches simultaneously. There is a difference as sampling is not part of a model.
Hello, about pipeline parallelism in DGL, I think that DGL overlap sampling and computation only when using CPU sampling(if sampler option
use_prefetch_thread is True).
And if GPU sampling is used, then
use_prefetch_thread must be False, so can DGL overlap GPU sampling and computation? If so, how DGL achieves that, could you please show me the code snippet?
Looking forward to your reply!
Can anyone help? Or I would create a new topic.
Sorry for the late reply! According to my knowledge, in DGL datalaoder, you can spawn different processes for training and sampling, so I think computation and communication can be overlapped, for example:
Process A : computing batch 1
Process B : sampling batch 2.
Thanks for your reply!
Yes, I agree. But I think this feature only works when
use_prefetch_thread option in DGL dataloader is true.
And when using GPU sampling, it can not be true:
So when using CPU sampling, yes; But when using GPU sampling, DGL can not overlap computing and sampling, am I wrong?
For one process, Yes.
But consider this perspective: there may be several processes involved in both computation and sampling. One computing process could obtain sampling results from another sampling process, implying a significant degree of overlap between them.
Thanks for your reply!
In fact, I think this kind of overlapping on GPU can be implemented by CUDA stream, and I used to try to find some code snippets about that in DGL source code, but failed.
So to confirm again, DGL has not implemented this kind of overlapping on GPU?
For one GPU computing and sampling, the only way to implement it is to use different CUDA streams, which need to be double confirmed. Otherwise there always have overlapping.
And more, CUDA steam is supported but in the case sampling in CPU and computing in GPU, refer https://github.com/dmlc/dgl/blob/c298223f5de09e9ee265cf6a9fb9145b692b4c5b/python/dgl/dataloading/dataloader.py#L842 for more details.
Thanks for your help
Confirmed there is no kennel overlapping between sampling and computation. And there are some additional input:
- Even use multi CUDA stream can’t ensure parallelism, since SM resources could be occupied by one task.
- Both 2 tasks are computation intensive tasks, it may be meaningless to overlap these 2 tasks.
Thanks for your detailed reply!