The accuracy suddenly dropped when training GAT with example code

When run the example code train_ppi.py, I found my training accuracy dropping dramatically during the last epochs.

As shown below:

I tried many times but the result was the same. Accuracy is almost halved during last few epochs.

my environment
os: ubuntu20
python: 3.8.15
backen: pytorch ‘1.11.0+cu113’
dgl: ‘0.9.1post1’
gpu: NVIDIA GeForce RTX 3090
cuda: 11.7

Thanks for your help!

I tried the script for a few runs. The phenomenon did occasionally happen.

After checking the gradient norm around performance dropping. The reason could be that the model occasionally switches to another local optimum on the loss landscape because of large gradients.

You may try the following solutions to address this issue:

  1. early-stop or run test on the checkpoint with the best validation performance.
  2. clip the gradient by adding torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) after backward.

Thank you! This problem was indeed solved with the addition of gradient clipping. But there is one small problem that still bothers me. When I ran the example code of previous version of dgl(0.4), the accuracy dropping problem did not happen. I wonder if this is due to a change in the implementation of GAT or a change in the the ppi data set (LegacyPPIDataset to PPIDataset).
Again, thank you very much for your help!

In the example code of dgl0.4, the early-stop technique is included. It’s another solution as mentioned above.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.