The accuracy suddenly dropped when training GAT with example code

When run the example code train_ppi.py, I found my training accuracy dropping dramatically during the last epochs.

As shown below:

I tried many times but the result was the same. Accuracy is almost halved during last few epochs.

my environment
os: ubuntu20
python: 3.8.15
backen: pytorch ‘1.11.0+cu113’
dgl: ‘0.9.1post1’
gpu: NVIDIA GeForce RTX 3090
cuda: 11.7

Thanks for your help!

I tried the script for a few runs. The phenomenon did occasionally happen.

After checking the gradient norm around performance dropping. The reason could be that the model occasionally switches to another local optimum on the loss landscape because of large gradients.

You may try the following solutions to address this issue:

  1. early-stop or run test on the checkpoint with the best validation performance.
  2. clip the gradient by adding torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) after backward.