DGL CPU/GPU sampling and CUDA stream usage

Hello everyone! I do some research on sampling and DGL stream, I enable TensorAdaptor, set use_alternate_streams=True in DGL DataLoader and use GPU/CPU sampling with feature prefetch in NeighborSampler:

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

# GPU Sampling
graph=graph.to(device)
train_nids=train_nids.to(device)

Observation: DGL will create a non-default stream for every epoch(i.e. everytime the DGL DataLoader init). I trained 4 epochs, and the profiler shows 4 non-default stream:
image

Now I have some questions:
Q1: First, I found that after CPU sampling with prefetch, DGL do index select first and then do feature transfer :


No problem. Slice the features, then transfer.
However, after GPU sampling with prefetch, DGL do feature transfer first and then do index select:

So why is there such a difference?

Q2: The documents says:

That’s true, when do GPU sampling, feature transfer(first) and index select kernel(then) are on a non-default stream 14. But I found that DGL now use pageable host memory for feature transfer:

and it cannot be pinned maybe due to pin_prefetcher cannot be True when use GPU sampling:

As far as I know, pageable host memory causes asynchronous CUDA memcpy operations(e.g. cudaMemcpyAsync) to block and be executed synchronously. So what’s the point of using stream and cudaMemcpyAsync here? I think it will not do any good, because the usage of CUDA stream here overlap nothing.

GPU sampling by default assumes that the features are also on GPU already. What is happening under the hood is that, if you have a graph on CPU first and then call .to('cuda'), the feature won’t get copied until it is being accessed. This might be the reason why it appears to call feature transfer first and then call index select.

use_alternate_streams should always use with pin_prefetcher. If pin_prefetcher is False, I don’t think use_alternate_streams will do anything.

Thanks! After thinking about your reply, I made a summary:

CPU sampling with feature prefetch:

  1. do CPU sampling to get MFGs(subgraphs)
  2. recursive apply for all LazyFeature declared in DGL Sampler:
    • index select(feature slice) on CPU
    • sliced feature transfer
  3. input nodes, seed nodes, subgraphs structure(MFGs) transfer

CPU sampling without feature prefetch:

  1. do CPU sampling to get MFGs(subgraphs)
  2. input nodes, seed nodes, subgraphs structure(MFGs) transfer
  3. feature slice on CPU and transfer until first time feature data are used

GPU sampling with feature prefetch:

  1. train id and graph structure transfer
  2. do GPU sampling to get MFGs(subgraphs)
  3. recursive apply for all LazyFeature declared in DGL Sampler:
    • all nodes feature transfer
    • index select(feature slice) on GPU
      and no need to transfer input nodes, seed nodes, subgraphs structure(MFGs) any more.

GPU sampling without feature prefetch:

  1. train id and graph structure transfer
  2. do GPU sampling to get MFGs(subgraphs)
  3. feature slice on CPU and transfer until first time feature data are used
    and no need to transfer input nodes, seed nodes, subgraphs structure(MFGs) any more.

And use_alternate_streams shoule be used with pin_prefetcher in CPU sampling.

1 Like

From the above summary. I have some questions if you can answer?

What does " LazyFeature declared in DGL Sampler:" line means?

what does " * index select(feature slice) on CPU" means ?

what does " * sliced feature" means?

what it means by “input nodes”

thanks

Using feature prefetch in DGL sampler. These features will be assigned as LazyFeature.

I mean the feature slice is done on CPU, and if GPU sampling and feature prefetch enabled, DGL seems to transfer features to GPU first, so it can be done on GPU by indexSelectLargeIndex kernel.

If graph has 100000 nodes, and feature size is 128, batch size is 1024. Node features: 100000×128, sliced features: 1024×128.

input nodes=srcnodes

Also I am not sure that all I said above is totally correct…(my summary and my explanation)

1 Like

Thank you i believe it is correct. I was trying to understand feature prefetch but was unable until now. Thanks

Hello, thank you for your summary. I have a question regarding GPU sampling in DGL. DGL mentions that for large graphs, it uses uva (unified virtual address) approach. I wanted to ask if you observed whether DGL achieves uva solely by using .to(GPU) on variables in Python, or if it involves some parts of C++ code as well?

Sorry, I did not pay attention to DGL UVA when I do experiments. Maybe you can open a new topic about that.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.