When does the CPU-GPU feature copy occur?

After looking at Training GNN with Neighbor Sampling for Node Classification and 6.8 Feature Prefetching, I am still confused on when features are moved from CPU memory to GPU memory when no prefetching is specified in the sampler.

6.8 Feature Prefetching suggests:

Accessing subgraph features will incur data fetching from the original graph immediately while prefetching ensures data to be available before getting from data loader.

Meanwhile, the L1_large_node_classification.py example used in the first link has the following line

What does “here” mean? My understanding is that there is some kind of “lazy” mechanism in play here, but when does the copy actually end up happening? I am interested in measuring the latency of the CPU-GPU copy.

Also, the DataLoader docs mention the device argument determines where the MFG is generated. Does the MFG in this context include features as well, meaning that features are already placed on the specified device (seemingly going against the lazy mechanism)? I am trying to understand the relation of this argument to the aforementioned CPU-GPU copy.

Thanks in advance.

Hi @inhrz, really nice question! This note should be outdated due to historical reasons. Lazy fetching was used for dataloader, but now the fetching happens inside the dataloader at Line 495. We will remove it after double confirm. To measure the transfer latency, see Line 338 and add your timer methods around it.

1 Like

Thanks for the reply @peizhou001 ! That makes sense about the dataloader, thanks for including those links. I have two follow up questions:

  1. Is lazy fetching for node features still used for DGLGraph? Line 247 and Line 269 suggests yes. Is the following correct?
g = dgl.graph(..., device='cpu') # graph is on CPU
g = g.to('cuda') # graph features not copied to GPU yet
print(g.ndata['feat']) # graph features now copied to GPU
  1. Is there any way to disabling prefetching in DataLoader? I think I was confused by the arguments before. I just want to check my understanding here, no specific use case. It seems like the answer is no.

Hi @inhrz , first let me correct myself, you can not measure the transfer latency by adding wrapper around Line 338 when the tensor is pinned, because the to operation is non-blocking and it is really non-blocking when the tensor is pinned but blocking otherwise. Then for your question:

  1. Yes. Your understanding is definitely correct.
  2. To achieve this, you can set device in dataloader as CPU, and then explicitly copy the data to GPU.
1 Like

Hi, so how can I profile the data transfer time here dgl/dataloader.py at 5598503aa097f3c368b9e0025e15e26b904e71f2 · dmlc/dgl · GitHub (by using cuda.synchronize before time.time()?) Also I use tensorboard profiler and find dgl/dataloader.py at 5598503aa097f3c368b9e0025e15e26b904e71f2 · dmlc/dgl · GitHub is the most time-cosuming and I read the cuda code of this line and find it was the exact function of data transfer? I dont know whether my understading is correct, maybe I misunderstood the function to(device)

here is my result of profiling, I found the function of IndexSelectMultiKernelAligned is very time-consuming on both CPU and GPU so I have question that what does this function mean and whether there is datatransfer in this function. Then in this case, is the process of actually using PCIE prefetch some features later? batch = recursive_apply(batch, lambda x: x.to(dataloader.device, non_blocking=True)) and I am more doubtful that this non-blocking operation will not show memcpy records in the profile tool? (At least I saw that this paragraph does not seem to take a lot of time on the CPU and GPU)

Hi, cuda.synchronize + time.time() is an adoptable way to measure the transfer time.
And to(device) is exactly the function of data transfer, it is likely to take a large percentage of time as data movement could be time-consuming.
One thing need to notice is that although the to() function has set the non-blocking to true, it is only async when coupled with pin memory.

IndexSelectMultiKernelAligned is a cross-device version IndexSelect, so data transfer happens in this function and use PCIE/NVLink as bus.
For profiling tool, I’m not sure if it will record the memcpy operation, a suggestion is to do a unit test against the tool to find its boundary.

I don’t know if there are some problems in my understanding. Is there data transmission in both these two functions? I feel that to() and IndexSelectMultiKernelAligned have data transmission after reading your answer? If so, then this What is the difference between the two data transfer contents?

Right, both 2 contains data transfer. to is a general method which send the tensor who call it. IndexSelectMultiKernelAligned send selected content from the tensor, look at torch.index_select for reference.

OK, I think I may get the idea. I think I just want to confirm : the IndexSelectMultiKernelAligned with slice the feature of sampled node and transfer it to GPU , ‘to’ was used to transfer the node (graph structure) to GPU?
is that right?

Yes in this scenario