Internal Message Passing Implementation

Hey. I’m working on a small GNN library in a different programming language, and I’ve been looking at the internals of several libraries: PyG, Microsoft GNN samples, DeepMind GraphNets, and of course DGL. My intention for this library is primarily for learning purposes of various GNN algorithms and for use with a couple projects I’m doing in this other language.

I have a working implementation of the scatter-gather functions used by PyG, and I’ve implemented some broadcasting similar to that used in the MSFT/DeepMind repos, but since DGL is arguably more ambitious than these, I’ve struggled to figure out exactly how DGL performs message passing. Mattias, the PyG maintainer seemed to think that DGL does not use a system similar to his.

Would someone mind pointing me in the direction of some bit of well documented source code for the core message passing algorithm? Or post a brief explanation here of how message passing is done in DGL? I’m not looking for in-depth documentation so much as an overview that I can use to develop and learn with. The fused message passing paper provides a good overview of the API, but unfortunately doesn’t contain much info on the actual implementation.

For the API level tutorial, I suggest starting from our slides for GTC 2019.

For the implementation, we use scheduler to generate intermediate representations(IR) for basic operation on graphs, and have another executor to execute them.

@Pangurbon thanks for your attention.

Let me briefly introduce how dgl parallelizes message function, reduce function, and apply edge function respectively. (apply node function is trivial).

Message function

Message function takes EdgeBatch as input, and results in a message feature on edges.

If the message function is defined by users (instead of built-in ones). For source/destination node features, dgl would broadcast node features to edges via framework dependent scatter functions.

Note that only the features requested by message functions would be materialized on edges. (i.e. the broadcasting is lazy evaluated).

The result message is temporary and would then be dispatched to reduce functions and digested.

Reduce function

The reduce function describes how destination node aggregates messages (or possibly interact with the node feature of destination node).

Reduce function takes a NodeBatch (including a field called mailbox represents the incoming messages) as input, and results in a node feature.

Note that not all nodes have the same in-degrees, the parallel of all nodes’ reduce functions is not trivial. Currently DGL uses degree bucketing policy, which means auto-batching the reduce function of nodes with the same in-degrees, this is a good solution for uniform degree graph (subgraph produced by some GCN sampling algorithms) or trees (e.g. TreeLSTM) where degree variance is small.

The messages of nodes with same in-degrees are batched as tensors of shape (batch_size, degree, *) in mailbox, in custom reduce functions, users could operate on these message tensors.

For graph with high degree variance, you can use our built-in message & reduce functions or our incoming feature: degree padding, which batches nodes with similar degrees and pad to the same length with 0 or any value you like.

Apply edge function

Similar to message function, the only difference is that the result would be stored on edges (not temporary), this being said, there is no reduce function.

Builtin functions

Note that many combinations of message functions and reduce functions can be optimized, for example: the combination of a source node feature multiply edge feature message function(u_mul_e), with a sum reducer is the classic SPMV problem, and the attention operation: source node feature dot product with destination node feature(u_dot_v) is SDDMM operation, and there has already been many research on optimizing these sparse matrix operations.

In most cases we do not have to materialize node features on edges (which is costly when graph is dense, because |E|>>|V|), and we may have better parallel strategies than degree bucketing. To address these problems, DGL provides optimization for following message-passing routines:

  • message function:
    • x_op_y
      • op: add/sub/mul/div/dot
      • x: u, v, e (source node, destination node, edge)
      • y: u, v, e
    • copy_x
      • x: u, e
  • reduce function: sum/max/min/mean/prod/none
    • If none, the result is on edges.

We provide a set of builtin functions acts as placeholders, our scheduler would detect the combination of built-in functions and select corresponding kernels to execute them.

The design of kernel to exectue these message-passing routines partly follows Gunrock, a high-performant GPU library for graph analysis (SSSP, pagerank, etc.). However, Gunrock has more general use case such as traversal-based algorithms (shortest path), we only focus on one step “Advance” (a term described in gunrock paper), so we simplify the design and implement a minimal version called minigun. The idea is edge parallel, each thread is responsible for the computation of a single edge, you can refer to the logic here.

These are some examples on how to write graph kernels in minigun:

  • spmm
  • edge_softmax, used in GAT and transformer.
  • SDDMM used in Transformer and the backward function of SPMM.

For some highly optimized operations that have already been implemented in CuSparse, we turn to those implementations for best performance.

Future

The builtin functions is general but only covers a subset of possible operations on graphs, and both degree bucketing/padding are not best parallel strategies on graphs. In the further we would like to deeply integrate with tvm and compile user-defined functions to some IR so that more operations could be fused.

Overall, the design is systematic and general, rendering it extensive and flexible for further optimizations.

2 Likes

Wow thank you for the amazing write up! This should be included in the docs somewhere definitely, it’s easily the best description of GNN implementation I’ve seen!

So scattering in the message section is not done with any reductions as I understand it right? PyG might use scatter_mul to scatter messages directly into destination nodes with a multiplication reduction. DGL as I understand you computes messages when necessary and then reduces using the bucketing policy.

How does the backwards pass work in this case then, with a simple copy_u message function and a mul reduction function (I see this isn’t included in the optimized routines, though)? The gradient for scatter_mul is something like gather(gradient * out, index) / source.

That’s my last question, I super appreciate your post, it’s given me a lot to digest!