I am confused about the calculation order of GAT


#1

In the implementation of GAT, you apply normalizer after computing the aggregated node features as following:


However, in the original paper, the authors apply normalizer before computing the aggregated node features as following:

I wonder why you adjust the order?

By the way, does the GAT model converges for the dataset cora?
I run the model by python train.py --dataset=cora. And the result is the following:


#2

Applying the normalizer before/after the aggregation is the same.

h_i'=\sum_{j\in\mathcal{N}_i}a_{ij}h_j=\sum_{j\in\mathcal{N}_i}\frac{e_{ij}}{z_i}h_j=\frac{1}{z_i}\sum_{j\in\mathcal{N}_i}e_{ij}h_j

For the training loss, we also observe similar phenomenon. I guess it is because of the dropout. I tried disabling all the droput. The loss decreases smoothly but the model overfits pretty heavily. We measured the test accuracy for 100 runs. We got average accuracy of 83.69% with std 0.529%, which matches the author’s reported result. So we believe the model actually converges.