Self attention and feature fusion over graphs


Is it possible to perform feature fusion using a self-attention mechanism? I have data distributed over a graph, and multiple feature modalities sampled per node that I would like to optimally combine. Let’s say I have a graph G={V,E} with N nodes. For each node, we sample M different modalities, such that each node v is initially characterized by M features vectors \in \mathbb{R}^{m_{1}}, \mathbb{R}^{m_{2}}...\mathbb{R}^{M}. For simplicity, assume m_{1} = m_{2} = ... = m_{M} = D. I’m interested in “fusing” these features in a more intelligent way than simply concatenating them and passing them through a linear layer or MLP.

Assuming each node v is characterized by a feature matrix x_{i} \in \mathbb{R}^{M \times D} (where M is the number of modalities, and D is the input dimension), a transformer approach over modalities would yield something like this

\begin{align} Q &= x_{i}W_{q} \\ K &= x_{i}W_{k} \\ V &= x_{i}V_{k} \\ \end{align}

where W_{q,k,v} \in \mathbb{R}^{D \times p}. We then have that

\begin{align} Z = softmax(\frac{QK^{T}}{\sqrt{D}}, axis=1)V \end{align}

where Z \in \mathbb{R}^{M \times D} and Z_{i} is the fusion of modalities with respect to modality i (I think? Correct me if I’m wrong). I could then sum over the rows of Z to compute the final combination. Does this sound correct? I havn’t found many papers on using self-attention for feature fusion, so any help is appreciated.


This sounds correct. However, you might want to ask this question elsewhere, say PyTorch’s discussion forum, for a double check as it is more related to general deep learning.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.