How to use dynamic edges per attention head?


#1

I am following the example of implementing Transformer using DGL. In my implementation, I would like to use a separate list of edge ids per attention head per layer. E.g consider a 1 layer encoder with 2 heads. The first head will learn to attend on immediate neighbour, and the second head will learn to attend next to immediate neighbour and so on. My current implementation is as follows:

        if per_head and len(per_head):
            for i in range(0, len(per_head)):
                g.apply_edges(src_dot_dst('k', 'q', 'score', i), per_head[i])
        else:
            g.apply_edges(src_dot_dst('k', 'q', 'score'), eids)

and then perform multiplication only for that head inside src_dot_dst. Other scores would be None or 0 I presume. I also tried another approach in which I use torch indexing to set scores of “non-active” heads (ones that do not belong in per_head[head_index]) to 0.

Will this affect pytorch autograd, since it is passing only a subset of the edge ids, and other nodes are effectively not touched in the given head ? Is there a better way of doing this ?


#2

Hi,

Setting other head to zeros with mask won’t affect autograd and set them to zeros with inplace operator would break autograd.

The way dgl using won’t affect autograd. At initialization, all edges data are set to zeros. And after computation, updated ones will be set to new values without breaking the autograd.

I don’t quite understand your if statement in code. However it seems good to me.


#3

Sorry for the confusion regarding the code. I am posting the modified src_dot_dst here

def src_dot_dst(src_field, dst_field, out_field, head_index=None):
    """
    This function serves as a surrogate for `src_dot_dst` built-in apply_edge function.
    """
    def func(edges):
        # If per_head is True then edges are defined per head. We will perform
        # multiplication per head edges[head_index].src[src_field][:,head_index,:] *
        if not head_index:
            return {out_field: (edges.src[src_field] * edges.dst[dst_field]).sum(-1, True) }
        else:
            # Per head, for each edge we will have a score. Instead we will now have scores for certain
            # edges for a given head. Other heads for same edges may or may not update
            return {
                out_field: (edges.src[src_field][:, head_index, :] * edges.dst[dst_field][:, head_index, :]).sum(-1, True)
            }
    return func

In the first post, the variable per_head is an array of edge ids like [ [ 1], [ 4, 5, 6], [2, 4] etc … upto size:n_heads]. So per_heads[i] returns the list of all edge ids that should be considered to calculate the attention score for that head. For all other edges, the score should be zero in that head. However, this fails with the following error:

dgl._ffi.base.DGLError: Cannot update column of scheme Scheme(shape=(1,), dtype=torch.float32) using feature of scheme Scheme(shape=(2, 1), dtype=torch.float32).

at g.apply_edges(src_dot_dst('k', 'q','score', i), per_head[i])

I suspect that this is because score feature is getting updated for different edge_ids per head and the len(edge_ids) per head in the per_head array need not match across heads. If I rename score to score0, score1 etc for each head, it runs until the output layer projection at which point the same error props up.

How do I use a mask to set the remaining edges to zero in each head ? I’ve tried a number of different combinations, but they all failed.


#4

It seems you need separate field for each head right?

The error is saying the shape of your return tensor is not right. It should be (Number_of_edges_applied, edges_feature_shape) and Edges_feature_shape is determined when you create this field and it’s immutable. If you want to store different edges_feature_shape, you need to use another field to do this.

You can print your output shape to see whether it’s expected.

Hope this could help!


#5

Thank you. Will try this.