Sampling in RGCN

sharifza · August 5, 2019, 11:22am

Hi, I have a question regarding the sampling in RGCN sample code.
It looks like at each epoch we select a positive sample set of size “batch_size” and a negative sample set of size “batch_size * negative_rate”. There are two questions raised.

In each epoch we do this process once, which means in each epoch we train only one batch. Since batch_size < train_data.size, we are throwing away a lot of positive samples in each epoch (Of course because of random sampling in the next epochs we will use some of the other positive samples, but a) we cannot guarantee that we ever see all of them. b) still, each epoch contains only one batch of positive samples.)
Is there a reason to have an imbalanced training set? The size of negative samples given the default parameters is 10 times more than positive samples in each batch.

Thanks in advance for the answers.

lingfan · August 15, 2019, 6:14pm

Hi @sharifza,

Both are great questions!

For question 1, sampling techniques are always used to deal with large graph, otherwise it’s not feasible to train with GPU. Sure, we cannot guarantee that we see all of the positive samples, but we are also hoping that the model has the ability to generalize rather than just memorize everything. Yes, each epoch contains one batch. You can actually think of it this way: there is only one epoch, and we are doing something like mini-batch training on graph.

For question 2, it’s actually very hard to decide the negative sampling rate. Graph is always very sparse with density \frac{|E|}{{|V|}^2} to be far less than 1%. In that sense, we should have far more than 100 times negative edges than positive edges. However, this would actually make the positive signal very weak, and model will tend to predict very possible edge to be negative. I don’t know why author chose 10 as negative rate. But I would say negative rate between 5 and 20 are commonly used. Maybe it’s just because empirically they work.