Edge_softmax (difference between 0.3 and 0.2)

When I update the DGL to latest version (0.3), I found the edge_softmax has been rewritten. However, this results the prediction of GAT become unstable with the same parameters (under version 0.2).

What is the difference between the Edge_softmax in 0.3 and 0.2?

We rewrote the edge softmax to use our own kernel. The interface also changed. Previously it returns two tensors (one score and one normalizer), while now it directly normalizes the input score tensor and returns it.

Could you say more about the instability? We’d like to know whether there is a potential bug there.

I compared the training loss and the validation accuracy for node classification (the standard citeseer data set). After about 300~350 epochs, the difference between version 0.2 and version 0.3 becomes large. And I cannot get the same accuracy (comparing to version 0.2) under same hype-parameters on both validation and test data.


If I comment out the backward part of EdgeSoftmax, that is, if I only use the forward part, I can get similar results of 0.2. Therefore, I guess is there might be a potential bug in the backward part??


Could you tell me how large the gap is (in terms of both loss and accuracy)?
In DGL 0.3 we use atomic_max/atomic_add to speed up edge softmax module; this would bring some non-determinism, but it’s not a bug.

The gap of accuracy is about 0.02, and the training loss of v0.2 converges to a stable status steadily, however, for v0.3., the training loss first reduces and then volatiles and show a tend of increase. (Learning rate = 0.0003 and number of epochs is set as 800).


I’ve checked the implementation of the backward module in edge_softmax and I think there is no problem with it.
As 0.02 is not a significant performance gap, could you please train the model with different initial random seed for multiple times and see if the problem still exists?

Ok, thanks. I will check it and make an update.