Hi

Is it possible to perform feature fusion using a self-attention mechanism? I have data distributed over a graph, and multiple feature modalities sampled per node that I would like to optimally combine. Let’s say I have a graph G={V,E} with N nodes. For each node, we sample M different modalities, such that each node v is initially characterized by M features vectors \in \mathbb{R}^{m_{1}}, \mathbb{R}^{m_{2}}...\mathbb{R}^{M}. For simplicity, assume m_{1} = m_{2} = ... = m_{M} = D. I’m interested in “fusing” these features in a more intelligent way than simply concatenating them and passing them through a linear layer or MLP.

Assuming each node v is characterized by a feature matrix x_{i} \in \mathbb{R}^{M \times D} (where M is the number of modalities, and D is the input dimension), a transformer approach over modalities would yield something like this

where W_{q,k,v} \in \mathbb{R}^{D \times p}. We then have that

where Z \in \mathbb{R}^{M \times D} and Z_{i} is the fusion of modalities with respect to modality i (I think? Correct me if I’m wrong). I could then sum over the rows of Z to compute the final combination. Does this sound correct? I havn’t found many papers on using self-attention for feature fusion, so any help is appreciated.

k