GPU Kernels while running graphsage model

nihal · October 23, 2020, 3:33pm

I am running a graphsage model on a single gpu, I understand that CSRMMT kernel can be seen during aggregation step with the update_all() function. But can someone explain why do I see 3 CSRMMT kernels for each layer with GSPMM op. Also what is the reason of transpose_readwrite_alignment kernel being called?

chenxuhao · March 9, 2021, 4:46pm

Hi, I got exactly the same question as yours. I wonder if you have figured it out or anyone has an idea about it.

The point is, if there are two sage conv layers, it is supposed to have 2 SpMM calls in each forward pass. But the nvprof shows that there are three calls to the “csrMmt_hyb_core” kernel.

zihao · March 10, 2021, 6:34am

For the transpose kernel, this is because the cusparseScsrmm2 API in CUDA 10 assumes the tensor is column-major and DGL’s feature tensor is row-major, we need to transpose the output matrix. If you use CUDA 11, then you will not encounter this problem because the new cusparseSpMM API with ALG2 assumes the tensor is row-major.

@chenxuhao for three csrMmt_hyb_core calls, this is because a single cusparseScsrmm2 call will trigger several csrMmt_hyb_core kernels. I suggest you upgrading your cuda to version 11 where the new API will only trigger a single csrmm_v2_kernel, and CUDA 11’s spmm is faster.

btw, I noticed that you were editing the binary_reduce_sum.cu in your fork, however, this is legacy code that has been deprecated since dgl 0.5, please edit this file:

github.com

dmlc/dgl/blob/master/src/array/cuda/spmm.cu

/*!
 *  Copyright (c) 2020 by Contributors
 * \file array/cuda/spmm.cu
 * \brief SPMM C APIs and definitions.
 */
#include <dgl/array.h>
#include "./spmm.cuh"
#include "./ge_spmm.cuh"
#include "./functor.cuh"
#include "../../runtime/cuda/cuda_common.h"

namespace dgl {

using namespace cuda;

namespace aten {
namespace {

/*! \brief Call cuBLAS geam API for transpose operation for float and double. */
template <typename DType>

This file has been truncated. show original

chenxuhao · March 10, 2021, 3:38pm

Hi Zihao,
Thank you so much for your reply.
I tried CUDA11.1 and it is faster than CUDA10.2 for SAGE.
But I am still confused about the fact that SAGE (mean aggregator) is much slower than GCN.

For example, when running GCN, I see 4 CusparseCsrmm2 calls per epoch (2 in FW and 2 in BW), and one epoch takes 90ms. When I run SAGE-mean, however, I see only 3 CusparseCsrmm2 calls per epoch (2 in FW and 1 in BW), but one epoch takes 400ms.
Note that the epoch time difference is mostly due to the CusparseCsrmm2 calls (i.e. cusparse::csrmm_v2_kernel).

Computation wise, GCN and SAGE-mean only differ in the normalization scores (i.e. Cij for each edge[i,j]). If normalization scores are pre-computed, the CusparseCsrmm2 calls should be exactly the same and thus the speed should be similar for GCN and SAGE-mean, right?

Also, this speed difference between GCN and SAGE is not only on GPU. I see this on CPU as well.
Seems like the spmm kernel calls for GCN and SAGE are different in some sense. But I checked that the parameters for the calls are the same: e.g. for reddit graph, m = 232965 n = 16 k = 232965 nnz = 114615892

Am I missing something?

Best,
Xuhao

zihao · March 11, 2021, 3:58am

Please notice that there was a optimization in our GCNConv implementation:

github.com

dmlc/dgl/blob/master/python/dgl/nn/pytorch/conv/graphconv.py#L413-L426


if self._in_feats > self._out_feats:
    # mult W first to reduce the feature size for aggregation.
    if weight is not None:
        feat_src = th.matmul(feat_src, weight)
    graph.srcdata['h'] = feat_src
    graph.update_all(aggregate_fn, fn.sum(msg='m', out='h'))
    rst = graph.dstdata['h']
else:
    # aggregate first then mult W
    graph.srcdata['h'] = feat_src
    graph.update_all(aggregate_fn, fn.sum(msg='m', out='h'))
    rst = graph.dstdata['h']
    if weight is not None:
        rst = th.matmul(rst, weight)

What it did basically is to determine whether to compute AXW in ((AX)W) or (A(XW)), in some cases the later way is better and I don’t think we have such optimization for GraphSAGE. Would you mind checking if it’s the reason?

chenxuhao · March 11, 2021, 5:37am

This makes sense to me. What prevents you to do this optimization for SAGE? Is it something fundamental or you just didn’t do it?

zihao · March 11, 2021, 6:12am

I think it’s just because GraphSAGE has multiple kinds of aggregators (mean/pool/lstm/…), however for mean aggregator such optimization is appilcable. I’ll create a PR adding this feature.