Edge_softmax (difference between 0.3 and 0.2)

xjtuwgt · June 25, 2019, 5:55pm

When I update the DGL to latest version (0.3), I found the edge_softmax has been rewritten. However, this results the prediction of GAT become unstable with the same parameters (under version 0.2).

What is the difference between the Edge_softmax in 0.3 and 0.2?

minjie · June 25, 2019, 6:14pm

We rewrote the edge softmax to use our own kernel. The interface also changed. Previously it returns two tensors (one score and one normalizer), while now it directly normalizes the input score tensor and returns it.

Could you say more about the instability? We’d like to know whether there is a potential bug there.

xjtuwgt · June 25, 2019, 9:16pm

I compared the training loss and the validation accuracy for node classification (the standard citeseer data set). After about 300~350 epochs, the difference between version 0.2 and version 0.3 becomes large. And I cannot get the same accuracy (comparing to version 0.2) under same hype-parameters on both validation and test data.

Thanks.

xjtuwgt · June 25, 2019, 11:14pm

If I comment out the backward part of EdgeSoftmax, that is, if I only use the forward part, I can get similar results of 0.2. Therefore, I guess is there might be a potential bug in the backward part??

Thanks

zihao · June 26, 2019, 2:06am

Could you tell me how large the gap is (in terms of both loss and accuracy)?
In DGL 0.3 we use atomic_max/atomic_add to speed up edge softmax module; this would bring some non-determinism, but it’s not a bug.

xjtuwgt · June 26, 2019, 2:43am

The gap of accuracy is about 0.02, and the training loss of v0.2 converges to a stable status steadily, however, for v0.3., the training loss first reduces and then volatiles and show a tend of increase. (Learning rate = 0.0003 and number of epochs is set as 800).

Thanks.

zihao · June 26, 2019, 7:50am

I’ve checked the implementation of the backward module in edge_softmax and I think there is no problem with it.
As 0.02 is not a significant performance gap, could you please train the model with different initial random seed for multiple times and see if the problem still exists?

xjtuwgt · June 26, 2019, 5:21pm

Ok, thanks. I will check it and make an update.