Problem of RGCN batch sampling

Lee-zix · July 20, 2019, 3:55am

Great Library!
I have one question about link_prediction.py in RGCN. More specifically, in codes below

g, node_id, edge_type, node_norm, data, labels = \
            utils.generate_sampled_graph_and_labels(
                train_data, args.graph_batch_size, args.graph_split_size,
                num_rels, adj_list, degrees, args.negative_sample)

g, node_id, edge_type, node_norm are the subgraph used to pass messages. (used for encoding). data and label are used to calculate loss based on the forward embeddings after message passing.(used for decoding and cal loss on this batch)

Why the edges in g are included in ‘data’, which means you used seen edges to predict this seen edges and propagate the loss. I am confused about this. Thanks very much for help!

lingfan · July 23, 2019, 9:57pm

Hi @Lee-zix,

That’s great question. You must be worrying about the ground truth leakage issue. You are right that the seen edges in the graph structure are also positive samples to be predicted. But the seen edges only account for half of the positive samples (see here). I guess maybe the R-GCN paper authors wish to have the model be able to predict both the seen edges and unseen positive edges. But yes, half of the positive samples are leaked.

Personally, I don’t know what’s the best practice. If you wish to make sure there is no leakage at all, then you can just remove the seen edges and only use the remaining unseen edges as positive samples. The Star-GCN paper has a discussion (section 3.4) about the linkage issue, which you may want to check out.

Lee-zix · July 26, 2019, 8:22am

Thanks very much for your reply! It helps a lot!

akabk · October 28, 2019, 12:02am

Hi @lingfan (or @mufeili ),

First I just want to say thanks so much for such a great package! As a DL noob, it’s really nice to have something that’s so inuitive.

I’m reviving this thread to see if you guys had any new thoughts on this topic in light of the heterograph and nodeflow apis. I am currently trying to generate RGCN embeddings for AMiner data, and adapted the code from the new heterograph example to do link prediction. I can try and port this over of course, but do you have any advice (or code examples) on the best way to subsample using heterographs?

Thanks for your time. Appreciate it.

B

mufeili · October 28, 2019, 6:11am

Currently, we do not have an established approach for sampling heterographs. This will be likely part of our next release, but I’m afraid you need to wait for some more time. Copy @BarclayII .

akabk · November 1, 2019, 4:04am

Thanks for the quick response. If not in heterograph, it would be great to see something like the Knowledge Graph api implemented for RGCN on big graphs.

Minor note: Perhaps I’m missing something, but it seems that in this link prediction example does not check that the negative samples are truly negative. You could add something like this to fix this:

_, uniq_indices = np.unique(total_samples,return_inverse=True,axis=0)
labels[size_of_batch:]=np.isin(uniq_indices[size_of_batch:],uniq_indices[:size_of_batch])

mufeili · November 1, 2019, 7:12am

@zihao said that this will not affect training. Copy him in case you want more detailed explanation.

akabk · November 1, 2019, 6:02pm

@zihao, just trying to be helpful. I can see that it does not affect training (on average it is 400 out of 30,000) but that is also a function of your graph size, negative sampling rate, etc… Seems like a small change to fix.

EDIT: It is out of 300,000 with 10x negative sampling haha. So yes, it doesn’t matter at all in the given example. But its still a nice corner case to fix.

def negative_sampling(pos_samples, num_entity, negative_rate):
size_of_batch = len(pos_samples)
num_to_generate = size_of_batch * negative_rate
neg_samples = np.tile(pos_samples, (negative_rate, 1))
labels = np.zeros(size_of_batch * (negative_rate + 1), dtype=np.float32)
labels[: size_of_batch] = 1
values = np.random.randint(num_entity, size=num_to_generate)
choices = np.random.uniform(size=num_to_generate)
subj = choices > 0.5
obj = choices <= 0.5
neg_samples[subj, 0] = values[subj]
neg_samples[obj, 2] = values[obj]
total_samples=np.concatenate((pos_samples, neg_samples))
_, uniq_indices = np.unique(total_samples,return_inverse=True,axis=0)
neg_in_pos=np.isin(uniq_indices[size_of_batch:],uniq_indices[:size_of_batch])
print("CORRECTLY LABELED", np.count_nonzero(neg_in_pos==0),"MISLABELED",np.count_nonzero(neg_in_pos==1))
labels[size_of_batch:]=neg_in_pos
return total_samples, labels

mufeili · November 3, 2019, 7:04am

Maybe you can create a PR?

akabk · November 4, 2019, 7:36pm

Sure can do.

@mufeili, I am really struggling with this RGCN subsampling example. I’m surprised that both the heterograph and regular dgl graph examples for entity classification run out of memory on a 12 gb gpu using only a moderate size graph (e.g. the AM example you guys provide). Is this surprising? Am I missing something fundamental?

My graph has 701296 nodes and 4926864 edges. There are only 8 relation types.

The subsampling method used in the example here does not seem to scale well. So I have been trying to figure out other approaches. Right now I have a hacky approach where I construct nodeflows and convert them into subgraphs to perform RGCN. This subgraph construction (perhaps because of reindexing?) is extremely slow and the maximum number of seed nodes I can accomodate is 500 (with degree limiting). Even if I were to compute all 1400 subgraphs in advance and store in memory, it would take a long time.

Are you surprised that I am having memory issues with graphs of this size, and if so what am I doing wrong? If not, what is the best approach to this problem using DGL .4? For example, do you have any advice for adapting relgraphconv for nodeflow?

Thanks for your help. Really appreciate it.