Understanding DGL CSR and COO

Hello everyone! I try to figure out GPU sampling algorithm in DGL, and I have some questions:
Q1: What’s the meaning of CSRMatrix data attribute?

I cannot even understand the comment

data index array. When is null, assume it is from 0 to NNZ - 1.

In my opinion, CSR or COO is used to represent sparse adjacent matrix, why are there numbers other than 0 and 1? I can see data[0] always be 12999 in my nsight eclipse during debug.

Q2: I think the code is trying to assign one block instead of one warp per row

because:

and BLOCK_SIZE=128, so TILE_SIZE=1. One block for one CSR row. Do I get it wrong?

For the first question, the nonzero entries of the matrix can be potentially non-binary and they may get re-ordered. From the comment, I guess this is the usage of data.

For Q1:

The data array is used to record the permuted order after format conversion. For example, suppose a graph has the following edges:

row: [1, 1, 0, 2, 0]
col: [1, 0, 2, 2, 0]

After converting it to CSR which will reorder the edges, DGL will record the original order in the data array.

rowptr: [0, 2, 4, 5]
col:    [0, 2, 0, 1, 2]
data:   [4, 2, 1, 0, 3]

This makes it easier to access the edge features as they are indexed by the original edge order (in DGL, also called edge IDs).

For Q2:

You are right. The current implementation assigns one block per row.

Thank you for solving my questions!

By the way, I want to know what is “TensorAdaptor” in DGL dataloader? I try to set use_alternate_streams=True in dgl.dataloading.DataLoader, and DGL shows:

DGLWarning: use_alternate_streams is turned off because TensorAdaptor is not available.

I googled the warning, and I have also searched for the “TensorAdaptor”, but I almost get nothing useful. So what is “TensorAdaptor” in DGL dataloader? And how can I make it available?
Or I would like to open a new topic about that.

TensorAdaptor is our internal plugin for optimizing PyTorch backends. If you are building from source, pass BUILD_TORCH=ON to cmake and it should enable it during building.

Thanks!So is there any document or tutorial about this internal plugin?

Currently no and we are moving towards native PyTorch extension so it may be removed in the future.

Thanks for your reply!

Sorry to bother you again. I do some research on sampling and DGL stream, I set use_alternate_streams=True in DGL DataLoader and use GPU/CPU sampling with feature prefetch in NeighborSampler:

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

# GPU Sampling
graph=graph.to(device)
train_nids=train_nids.to(device)

Observation: DGL will create a non-default stream for every epoch(i.e. everytime the DGL DataLoader init). I trained 4 epochs, and the profiler shows 4 non-default stream:
image

Now I have some questions:
Q1: First, I found that after CPU sampling with prefetch, DGL do index select first and then do feature transfer :


No problem. Slice the features, then transfer.
However, after GPU sampling with prefetch, DGL do feature transfer first and then do index select:

So why is there such a difference?

Q2: The documents says:

That’s true, when do GPU sampling, feature transfer(first) and index select kernel(then) are on a non-default stream 14. But I found that DGL now use pageable host memory for feature transfer:

and it cannot be pinned maybe due to pin_prefetcher cannot be True when use GPU sampling:

As far as I know, pageable host memory causes asynchronous CUDA memcpy operations(e.g. cudaMemcpyAsync) to block and be executed synchronously. So what’s the point of using stream and cudaMemcpyAsync here? I think it will not do any good, because the usage of CUDA stream here overlap nothing.

Could you create a new topic for this? Thanks.

Sure, the new topic link. Thanks in advance!