Understanding DGL CSR and COO

yofufufufu · April 14, 2023, 4:59pm

Hello everyone! I try to figure out GPU sampling algorithm in DGL, and I have some questions:
Q1: What’s the meaning of CSRMatrix data attribute?

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/include/dgl/aten/csr.h#L40-L67

      
        
            struct CSRMatrix {
              /** @brief the dense shape of the matrix */
              int64_t num_rows = 0, num_cols = 0;
              /** @brief CSR index arrays */
              IdArray indptr, indices;
              /** @brief data index array. When is null, assume it is from 0 to NNZ - 1. */
              IdArray data;
              /** @brief whether the column indices per row are sorted */
              bool sorted = false;
              /** @brief whether the matrix is in pinned memory */
              bool is_pinned = false;
              /** @brief default constructor */
              CSRMatrix() = default;
              /** @brief constructor */
              CSRMatrix(
                  int64_t nrows, int64_t ncols, IdArray parr, IdArray iarr,
                  IdArray darr = NullArray(), bool sorted_flag = false)
                  : num_rows(nrows),
                    num_cols(ncols),
                    indptr(parr),

This file has been truncated. show original

I cannot even understand the comment

data index array. When is null, assume it is from 0 to NNZ - 1.

In my opinion, CSR or COO is used to represent sparse adjacent matrix, why are there numbers other than 0 and 1? I can see data[0] always be 12999 in my nsight eclipse during debug.

Q2: I think the code is trying to assign one block instead of one warp per row

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/src/array/cuda/rowwise_sampling.cu#L119-L120

      
        
            // we assign one warp per row
            assert(blockDim.x == BLOCK_SIZE);

because:

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/src/array/cuda/rowwise_sampling.cu#L301-L318

      
        
            // select edges
            // the number of rows each thread block will cover
            constexpr int TILE_SIZE = 128 / BLOCK_SIZE;
            if (replace) {  // with replacement
              const dim3 block(BLOCK_SIZE);
              const dim3 grid((num_rows + TILE_SIZE - 1) / TILE_SIZE);
              CUDA_KERNEL_CALL(
                  (_CSRRowWiseSampleUniformReplaceKernel<IdType, TILE_SIZE>), grid, block,
                  0, stream, random_seed, num_picks, num_rows, slice_rows, in_ptr,
                  in_cols, data, out_ptr, out_rows, out_cols, out_idxs);
            } else {  // without replacement
              const dim3 block(BLOCK_SIZE);
              const dim3 grid((num_rows + TILE_SIZE - 1) / TILE_SIZE);
              CUDA_KERNEL_CALL(
                  (_CSRRowWiseSampleUniformKernel<IdType, TILE_SIZE>), grid, block, 0,
                  stream, random_seed, num_picks, num_rows, slice_rows, in_ptr, in_cols,
                  data, out_ptr, out_rows, out_cols, out_idxs);
            }

and BLOCK_SIZE=128, so TILE_SIZE=1. One block for one CSR row. Do I get it wrong?

mufeili · April 17, 2023, 3:10am

For the first question, the nonzero entries of the matrix can be potentially non-binary and they may get re-ordered. From the comment, I guess this is the usage of data.

minjie · April 17, 2023, 4:22am

For Q1:

The data array is used to record the permuted order after format conversion. For example, suppose a graph has the following edges:

row: [1, 1, 0, 2, 0]
col: [1, 0, 2, 2, 0]

After converting it to CSR which will reorder the edges, DGL will record the original order in the data array.

rowptr: [0, 2, 4, 5]
col:    [0, 2, 0, 1, 2]
data:   [4, 2, 1, 0, 3]

This makes it easier to access the edge features as they are indexed by the original edge order (in DGL, also called edge IDs).

For Q2:

You are right. The current implementation assigns one block per row.

yofufufufu · April 17, 2023, 11:13am

Thank you for solving my questions！

yofufufufu · April 17, 2023, 2:59pm

By the way, I want to know what is “TensorAdaptor” in DGL dataloader? I try to set use_alternate_streams=True in dgl.dataloading.DataLoader, and DGL shows:

DGLWarning: use_alternate_streams is turned off because TensorAdaptor is not available.

I googled the warning, and I have also searched for the “TensorAdaptor”, but I almost get nothing useful. So what is “TensorAdaptor” in DGL dataloader? And how can I make it available?
Or I would like to open a new topic about that.

minjie · April 18, 2023, 2:09am

TensorAdaptor is our internal plugin for optimizing PyTorch backends. If you are building from source, pass BUILD_TORCH=ON to cmake and it should enable it during building.

yofufufufu · April 18, 2023, 7:46am

Thanks！So is there any document or tutorial about this internal plugin?

minjie · April 18, 2023, 7:57am

Currently no and we are moving towards native PyTorch extension so it may be removed in the future.

yofufufufu · April 18, 2023, 8:02am

Thanks for your reply!

yofufufufu · April 18, 2023, 8:17pm

Sorry to bother you again. I do some research on sampling and DGL stream, I set use_alternate_streams=True in DGL DataLoader and use GPU/CPU sampling with feature prefetch in NeighborSampler:

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

# GPU Sampling
graph=graph.to(device)
train_nids=train_nids.to(device)

Observation: DGL will create a non-default stream for every epoch(i.e. everytime the DGL DataLoader init). I trained 4 epochs, and the profiler shows 4 non-default stream:

Now I have some questions:
Q1: First, I found that after CPU sampling with prefetch, DGL do index select first and then do feature transfer :

No problem. Slice the features, then transfer.
However, after GPU sampling with prefetch, DGL do feature transfer first and then do index select:

So why is there such a difference?

Q2: The documents says:

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/python/dgl/dataloading/dataloader.py#L750-L752

      
        
            use_alternate_streams : bool, optional
                (Advanced option)
                Whether to slice and transfers the features to GPU on a non-default stream.

That’s true, when do GPU sampling, feature transfer(first) and index select kernel(then) are on a non-default stream 14. But I found that DGL now use pageable host memory for feature transfer:

and it cannot be pinned maybe due to pin_prefetcher cannot be True when use GPU sampling:

github.com

dmlc/dgl/blob/4864a9f903f4b2bfc6c0ca5e5e14fdfbedd8baaa/python/dgl/dataloading/dataloader.py#L963-L967

      
        
            if pin_prefetcher is True:
                raise ValueError(
                    "pin_prefetcher=True is only effective when device=cuda and "
                    "sampling is performed on CPU."
                )

As far as I know, pageable host memory causes asynchronous CUDA memcpy operations(e.g. cudaMemcpyAsync) to block and be executed synchronously. So what’s the point of using stream and cudaMemcpyAsync here? I think it will not do any good, because the usage of CUDA stream here overlap nothing.

minjie · April 19, 2023, 4:15am

Could you create a new topic for this? Thanks.

yofufufufu · April 19, 2023, 8:13am

Sure, the new topic link. Thanks in advance!

minjie · April 19, 2023, 2:55pm