I am following the example of implementing Transformer using DGL. In my implementation, I would like to use a separate list of edge ids per attention head per layer. E.g consider a 1 layer encoder with 2 heads. The first head will learn to attend on immediate neighbour, and the second head will learn to attend next to immediate neighbour and so on. My current implementation is as follows:
if per_head and len(per_head): for i in range(0, len(per_head)): g.apply_edges(src_dot_dst('k', 'q', 'score', i), per_head[i]) else: g.apply_edges(src_dot_dst('k', 'q', 'score'), eids)
and then perform multiplication only for that head inside src_dot_dst. Other scores would be None or 0 I presume. I also tried another approach in which I use torch indexing to set scores of “non-active” heads (ones that do not belong in per_head[head_index]) to 0.
Will this affect pytorch autograd, since it is passing only a subset of the edge ids, and other nodes are effectively not touched in the given head ? Is there a better way of doing this ?