DGL CPU/GPU sampling and CUDA stream usage

yofufufufu · April 19, 2023, 8:10am

Hello everyone! I do some research on sampling and DGL stream, I enable TensorAdaptor, set use_alternate_streams=True in DGL DataLoader and use GPU/CPU sampling with feature prefetch in NeighborSampler:

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

# GPU Sampling
graph=graph.to(device)
train_nids=train_nids.to(device)

Observation: DGL will create a non-default stream for every epoch(i.e. everytime the DGL DataLoader init). I trained 4 epochs, and the profiler shows 4 non-default stream:

Now I have some questions:
Q1: First, I found that after CPU sampling with prefetch, DGL do index select first and then do feature transfer :

No problem. Slice the features, then transfer.
However, after GPU sampling with prefetch, DGL do feature transfer first and then do index select:

So why is there such a difference?

Q2: The documents says:

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/python/dgl/dataloading/dataloader.py#L750-L752

      
        
            use_alternate_streams : bool, optional
                (Advanced option)
                Whether to slice and transfers the features to GPU on a non-default stream.

That’s true, when do GPU sampling, feature transfer(first) and index select kernel(then) are on a non-default stream 14. But I found that DGL now use pageable host memory for feature transfer:

and it cannot be pinned maybe due to pin_prefetcher cannot be True when use GPU sampling:

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/python/dgl/dataloading/dataloader.py#L963-L967

      
        
            if pin_prefetcher is True:
                raise ValueError(
                    "pin_prefetcher=True is only effective when device=cuda and "
                    "sampling is performed on CPU."
                )

As far as I know, pageable host memory causes asynchronous CUDA memcpy operations(e.g. cudaMemcpyAsync) to block and be executed synchronously. So what’s the point of using stream and cudaMemcpyAsync here? I think it will not do any good, because the usage of CUDA stream here overlap nothing.

BarclayII · April 20, 2023, 2:11am

GPU sampling by default assumes that the features are also on GPU already. What is happening under the hood is that, if you have a graph on CPU first and then call .to('cuda'), the feature won’t get copied until it is being accessed. This might be the reason why it appears to call feature transfer first and then call index select.

use_alternate_streams should always use with pin_prefetcher. If pin_prefetcher is False, I don’t think use_alternate_streams will do anything.

yofufufufu · April 20, 2023, 11:31am

Thanks! After thinking about your reply, I made a summary:

CPU sampling with feature prefetch:

do CPU sampling to get MFGs(subgraphs)
recursive apply for all LazyFeature declared in DGL Sampler:
- index select(feature slice) on CPU
- sliced feature transfer
input nodes, seed nodes, subgraphs structure(MFGs) transfer

CPU sampling without feature prefetch:

do CPU sampling to get MFGs(subgraphs)
input nodes, seed nodes, subgraphs structure(MFGs) transfer
feature slice on CPU and transfer until first time feature data are used

GPU sampling with feature prefetch:

train id and graph structure transfer
do GPU sampling to get MFGs(subgraphs)
recursive apply for all LazyFeature declared in DGL Sampler:
- all nodes feature transfer
- index select(feature slice) on GPU
  and no need to transfer input nodes, seed nodes, subgraphs structure(MFGs) any more.

GPU sampling without feature prefetch:

train id and graph structure transfer
do GPU sampling to get MFGs(subgraphs)
feature slice on CPU and transfer until first time feature data are used
and no need to transfer input nodes, seed nodes, subgraphs structure(MFGs) any more.

And use_alternate_streams shoule be used with pin_prefetcher in CPU sampling.

tariqaf · May 4, 2023, 3:34pm

From the above summary. I have some questions if you can answer?

What does " LazyFeature declared in DGL Sampler:" line means?

what does " * index select(feature slice) on CPU" means ?

what does " * sliced feature" means?

what it means by “input nodes”

thanks

yofufufufu · May 5, 2023, 7:49am

Using feature prefetch in DGL sampler. These features will be assigned as LazyFeature.

I mean the feature slice is done on CPU, and if GPU sampling and feature prefetch enabled, DGL seems to transfer features to GPU first, so it can be done on GPU by indexSelectLargeIndex kernel.

If graph has 100000 nodes, and feature size is 128, batch size is 1024. Node features: 100000×128, sliced features: 1024×128.

input nodes=srcnodes

Also I am not sure that all I said above is totally correct…(my summary and my explanation)

tariqaf · May 5, 2023, 7:51am

Thank you i believe it is correct. I was trying to understand feature prefetch but was unable until now. Thanks

BearBiscuit05 · May 29, 2023, 7:45am

Hello, thank you for your summary. I have a question regarding GPU sampling in DGL. DGL mentions that for large graphs, it uses uva (unified virtual address) approach. I wanted to ask if you observed whether DGL achieves uva solely by using .to(GPU) on variables in Python, or if it involves some parts of C++ code as well?

yofufufufu · May 29, 2023, 9:53am

Sorry, I did not pay attention to DGL UVA when I do experiments. Maybe you can open a new topic about that.

system · June 28, 2023, 9:53am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.