Apply_edges with edge features in memory efficient way

Hi all,

Question: My model has edge features, and I want to adapt the DGL Heterogeneous Graph Transformer (HGT) to use them. What is the best way to implement it?

Context: Using source(s) / target (t) nodes, the HGT calculates K(s), Q(t), V(s) and then does apply_edges using K,Q to calculate the attention ‘t’ : sub_graph.apply_edges(fn.v_dot_u('q', 'k', 't')). It later calls the code below to forward the V * attention (message) to the target node, sums all of the source messages per edge type, and then take the mean across all summed edge type messages:

G.multi_update_all({etype : (fn.u_mul_e('v', 't', 'm'), fn.sum('m', 't')) \
                            for etype in edge_dict}, cross_reducer = 'mean')


I believe that I want to do is for the edge features to impact the V calculation (and possibly K), originally done based only on node features v = v_linear(h[srctype]).view(-1, self.n_heads, self.d_k). My guess is that V and K would have to be calculated during .apply_edges, and features would be concatenate(node_features, edge_features). My problem with that is that I imagine this would use way more resources, at least memory wise, because of the concatenation. I am new to DGL, so any ideas on how to implement it efficiently will be very well appreciated (my graph has millions of edges). Thanks!

Hi @felipemello , what’s the next operation after concatenation?
GAT also has a concatenation phase because there is a linear project after concatenation, and in this case we can use “first projection then addition” to replace “first concatenation then projection” to reduce memory footprint.

Reference: dgl/ at 0576a33b7f5c8a99f12f70a3aa523dc2200e4a70 · dmlc/dgl · GitHub

In general, if you compute K and V considering edge features, then the concatenation seems indeed inevitable. Based on what your edge features are we may come up with simplifications. What kind of edge features do you have?

Thanks for your replies! @zihao , cool. I had that idea too, but I wasnt sure if this would be any better. I am gonna try it and possibly compare.

@BarclayII, my edge features are simple vectors 1xN. Right now, my main feature is actually the edge type. I am using a Biological knowledge graph, and instead of having 10 types of edges between a node pair (e. g. DRUG - GENE), I have a single edge DRUG_GENE, and a vector that contains their relation in the format [1, 0, 0, 1, 1, 0, 0] for every relation that exists.

I see. What made you choose to have a binary vector encoding the relations between two nodes, instead of having multiple edge types between them? IMO in the latter case you can directly apply HGT.

What made you choose to have a binary vector encoding the relations between two nodes, instead of having multiple edge types between them?

So far, it reduced the number of edge types from 74 to 28. Considering that with GNN we usually loop over the edge types to do computations, I imagine that it boosted speed and memory by a significant amount, and probably reduced overfitting and improved generalization, since I wont have 8 different layers for DRUG-GENE associations, for example.

But even if I didnt do it, I still plan on using other edge features in the future.

Not necessarily, actually, we have a plan to replace loop w/ customized operators [WIP][performance] Implement segment gemm with multiple cuda streams. by yzh119 · Pull Request #2715 · dmlc/dgl · GitHub.