@Pangurbon thanks for your attention.
Let me briefly introduce how dgl parallelizes message function, reduce function, and apply edge function respectively. (apply node function is trivial).
Message function takes EdgeBatch as input, and results in a message feature on edges.
If the message function is defined by users (instead of built-in ones). For source/destination node features, dgl would broadcast node features to edges via framework dependent scatter functions.
Note that only the features requested by message functions would be materialized on edges. (i.e. the broadcasting is lazy evaluated).
The result message is temporary and would then be dispatched to reduce functions and digested.
The reduce function describes how destination node aggregates messages (or possibly interact with the node feature of destination node).
Reduce function takes a NodeBatch (including a field called
mailbox represents the incoming messages) as input, and results in a node feature.
Note that not all nodes have the same in-degrees, the parallel of all nodes’ reduce functions is not trivial. Currently DGL uses degree bucketing policy, which means auto-batching the reduce function of nodes with the same in-degrees, this is a good solution for uniform degree graph (subgraph produced by some GCN sampling algorithms) or trees (e.g. TreeLSTM) where degree variance is small.
The messages of nodes with same in-degrees are batched as tensors of shape
(batch_size, degree, *) in mailbox, in custom reduce functions, users could operate on these message tensors.
For graph with high degree variance, you can use our built-in message & reduce functions or our incoming feature: degree padding, which batches nodes with similar degrees and pad to the same length with 0 or any value you like.
Apply edge function
Similar to message function, the only difference is that the result would be stored on edges (not temporary), this being said, there is no reduce function.
Note that many combinations of message functions and reduce functions can be optimized, for example: the combination of a source node feature multiply edge feature message function(
u_mul_e), with a sum reducer is the classic SPMV problem, and the attention operation: source node feature dot product with destination node feature(
u_dot_v) is SDDMM operation, and there has already been many research on optimizing these sparse matrix operations.
In most cases we do not have to materialize node features on edges (which is costly when graph is dense, because
|E|>>|V|), and we may have better parallel strategies than degree bucketing. To address these problems, DGL provides optimization for following message-passing routines:
- message function:
- op: add/sub/mul/div/dot
- x: u, v, e (source node, destination node, edge)
- y: u, v, e
- reduce function:
none, the result is on edges.
We provide a set of builtin functions acts as placeholders, our scheduler would detect the combination of built-in functions and select corresponding kernels to execute them.
The design of kernel to exectue these message-passing routines partly follows Gunrock, a high-performant GPU library for graph analysis (SSSP, pagerank, etc.). However, Gunrock has more general use case such as traversal-based algorithms (shortest path), we only focus on one step “Advance” (a term described in gunrock paper), so we simplify the design and implement a minimal version called minigun. The idea is edge parallel, each thread is responsible for the computation of a single edge, you can refer to the logic here.
These are some examples on how to write graph kernels in minigun:
edge_softmax, used in GAT and transformer.
SDDMM used in Transformer and the backward function of SPMM.
For some highly optimized operations that have already been implemented in CuSparse, we turn to those implementations for best performance.
The builtin functions is general but only covers a subset of possible operations on graphs, and both degree bucketing/padding are not best parallel strategies on graphs. In the further we would like to deeply integrate with tvm and compile user-defined functions to some IR so that more operations could be fused.
Overall, the design is systematic and general, rendering it extensive and flexible for further optimizations.