Fastest way to implement pairwise attention score

Aceticia · October 15, 2020, 11:57pm

I’m calculating the pairwise attention score between neighbors. I’m currently using for loops to calculate it, but I don’t think this is quite effective dgl code. I wonder how I can use native functions like update_all to calculate this? The calculation is simple, given two nodes of an edge you just compute score=h_src^TAh_dst. Looking at the docs, I think I can use apply_edges to first populate the edges with scores, but what’s next? Any help is appreciated, thanks!

mufeili · October 16, 2020, 3:16am

You can follow the implementation of GAT and use edge_softmax.

Aceticia · October 16, 2020, 4:39pm

But in GAT implementation the scores from different neighbors will eventually be reduced and thus can’t distinguish between neighbors. I need to gather something like an NxN matrix that specifies the score from any pair of connected nodes, node 1 to node 2, etc.

Will using apply_edge and then looping over all edges be significantly slower than something else?

mufeili · October 16, 2020, 5:06pm

What do you mean by “the scores from different neighbors will eventually be reduced and thus can’t distinguish between neighbors”? The attention scores are computed with a_{ij}=\frac{e^{z_{ij}}}{\sum_{k\in\mathcal{N}(j)}e^{z_{kj}}} and two edges (i_1, j) and (i_2, j) will indeed have different attention scores.

If you just want to retrieve and export the attention scores to a sparse matrix, you can follow the example here.

zihao · October 19, 2020, 8:14am

apply_edges compute output edge feature given input source/destination node feature and edge features. To implement Transformer-like dot product attention score, you can use

import dgl.function as fn

g.apply_edges(fn.u_dot_v(...))

and edge_softmax function computes the attention weight normalized by destination/source nodes.

Could you please elaborate more on “distinguish between neighbors” so that I could understand your scenario?