@Pangurbon thanks for your attention.
Let me briefly introduce how dgl parallelizes message function, reduce function, and apply edge function respectively. (apply node function is trivial).
Message function
Message function takes EdgeBatch as input, and results in a message feature on edges.
If the message function is defined by users (instead of builtin ones). For source/destination node features, dgl would broadcast node features to edges via framework dependent scatter functions.
Note that only the features requested by message functions would be materialized on edges. (i.e. the broadcasting is lazy evaluated).
The result message is temporary and would then be dispatched to reduce functions and digested.
Reduce function
The reduce function describes how destination node aggregates messages (or possibly interact with the node feature of destination node).
Reduce function takes a NodeBatch (including a field called mailbox
represents the incoming messages) as input, and results in a node feature.
Note that not all nodes have the same indegrees, the parallel of all nodes’ reduce functions is not trivial. Currently DGL uses degree bucketing policy, which means autobatching the reduce function of nodes with the same indegrees, this is a good solution for uniform degree graph (subgraph produced by some GCN sampling algorithms) or trees (e.g. TreeLSTM) where degree variance is small.
The messages of nodes with same indegrees are batched as tensors of shape (batch_size, degree, *)
in mailbox, in custom reduce functions, users could operate on these message tensors.
For graph with high degree variance, you can use our builtin message & reduce functions or our incoming feature: degree padding, which batches nodes with similar degrees and pad to the same length with 0 or any value you like.
Apply edge function
Similar to message function, the only difference is that the result would be stored on edges (not temporary), this being said, there is no reduce function.
Builtin functions
Note that many combinations of message functions and reduce functions can be optimized, for example: the combination of a source node feature multiply edge feature message function(u_mul_e
), with a sum reducer is the classic SPMV problem, and the attention operation: source node feature dot product with destination node feature(u_dot_v
) is SDDMM operation, and there has already been many research on optimizing these sparse matrix operations.
In most cases we do not have to materialize node features on edges (which is costly when graph is dense, because E>>V
), and we may have better parallel strategies than degree bucketing. To address these problems, DGL provides optimization for following messagepassing routines:
 message function:

x_op_y
 op: add/sub/mul/div/dot
 x: u, v, e (source node, destination node, edge)
 y: u, v, e

copy_x
 reduce function:
sum/max/min/mean/prod/none
 If
none
, the result is on edges.
We provide a set of builtin functions acts as placeholders, our scheduler would detect the combination of builtin functions and select corresponding kernels to execute them.
The design of kernel to exectue these messagepassing routines partly follows Gunrock, a highperformant GPU library for graph analysis (SSSP, pagerank, etc.). However, Gunrock has more general use case such as traversalbased algorithms (shortest path), we only focus on one step “Advance” (a term described in gunrock paper), so we simplify the design and implement a minimal version called minigun. The idea is edge parallel, each thread is responsible for the computation of a single edge, you can refer to the logic here.
These are some examples on how to write graph kernels in minigun:
 spmm

edge_softmax, used in GAT and transformer.

SDDMM used in Transformer and the backward function of SPMM.
For some highly optimized operations that have already been implemented in CuSparse, we turn to those implementations for best performance.
Future
The builtin functions is general but only covers a subset of possible operations on graphs, and both degree bucketing/padding are not best parallel strategies on graphs. In the further we would like to deeply integrate with tvm and compile userdefined functions to some IR so that more operations could be fused.
Overall, the design is systematic and general, rendering it extensive and flexible for further optimizations.