I am confused about the calculation order of GAT


In the implementation of GAT, you apply normalizer after computing the aggregated node features as following:

However, in the original paper, the authors apply normalizer before computing the aggregated node features as following:

I wonder why you adjust the order?

By the way, does the GAT model converges for the dataset cora?
I run the model by python train.py --dataset=cora. And the result is the following:


Applying the normalizer before/after the aggregation is the same.


For the training loss, we also observe similar phenomenon. I guess it is because of the dropout. I tried disabling all the droput. The loss decreases smoothly but the model overfits pretty heavily. We measured the test accuracy for 100 runs. We got average accuracy of 83.69% with std 0.529%, which matches the author’s reported result. So we believe the model actually converges.