Negative Sampling in hetero-RGCN for Link Prediction

like · July 18, 2020, 4:01am

Hi, very helpful library!

I’m a newbie, and trying to use hetero-RGCN (implemented by dgl.nn.pytorch.conv.RelGraphConv) to do the link prediction task.

I load my custom data into DGLHeteroGraph with several node types and edge types, and would like to predict the probability of existence of certain one edge types. To train the model, I need edge sampler. Is there a negative sampler API that can manipulate DGLHeteroGraph directly? Would you mind giving an example?

I noticed dgl.data.LinkPredDataLoader from link seems solves the problem. Is that a part of features in DGL v0.5?

Thank you very much!

classicsong · July 20, 2020, 7:18am

You can look at code here:

github.com

classicsong/dgl/blob/a21b507164dc2827eea938084718c97ac1ece1e2/examples/pytorch/rgcn-hetero/link_predict_mb.py#L207-L359


def sample_blocks(self, seeds):
    pseed = th.stack(seeds)
    bsize = pseed.shape[0]
    if self.num_neg is not None:
        nseed = th.randint(self.num_edges, (self.num_neg,))
    else:
        nseed = th.randint(self.num_edges, (bsize,))
    g = self.g
    etypes = self.etypes
    netypes = self.netypes
    fanouts = self.fanouts
    phead_ids = self.phead_ids
    ptail_ids = self.ptail_ids
    nhead_ids = self.nhead_ids
    ntail_ids = self.ntail_ids

    phead_type = self.phead_type
    ptail_type = self.ptail_type
    nhead_type = self.nhead_type
    ntail_type = self.ntail_type

This file has been truncated. show original

As you are only doing neg sampling of only one edge type. You can simply the code a lot.

wangjunji · August 11, 2021, 1:48pm

Hi, I am using your code to do hetero graph link prediction in mini-batch mode.

github.com

classicsong/dgl/blob/0bb952a5e8/examples/pytorch/rgcn/link_predict_hetero_mb.py#L211

    
      
              self.num_entities = g.number_of_nodes()
              self.num_neg = num_neg
              self.fanouts = fanouts
              self.is_train = is_train
              self.keep_pos_edges = keep_pos_edges
          
          
def sample_blocks(self, seeds):
              pseeds = th.tensor(seeds).long()
              bsize = pseeds.shape[0]
              if self.num_neg is not None:
                  nseeds = th.randint(self.num_entities, (self.num_neg * 2,))
              else:
                  nseeds = th.randint(self.num_entities, (bsize * 2,))
              nseeds, reverse_idx = th.unique(nseeds, return_inverse=True)
          
          
    g = self.g
              fanouts = self.fanouts
              assert len(g.canonical_etypes) == 1
              p_subg = g.edge_subgraph({g.canonical_etypes[0] : pseeds})
          
          
    p_g = dgl.compact_graphs(p_subg)

I have some question on your negative sampling implementation. It seems that all negative heads and tails are picked randomly from all nodes in heterograph regardless of the src type and dst type of the edge. Since the distribution of the node type is extremely imbalanced, such negative sampling strategy may sample out many easy samples which cannot learn a better representation of the relation.

I have also noticed that there is a chunk size hyper param for sharing negative edges in positive pairs when calculating neg score which is designed for speeding up the training / validation process and indirectly increase the negative sample size I guess.

Any suggestion or example code on how to sample high quality negative samples with type constraints without slowing down the training time? @classicsong

classicsong · August 13, 2021, 7:57am

That sampler is mainly designed for konwledge graphs (only one node type).
If you care more about the model performance, I suggest you to use the dgl.dataloading.EdgeDataLoader

And there are some examples there.

BarclayII · August 16, 2021, 6:53am

As this is using outdated code I’m closing this thread. Please make a new topic if you have further questions. Thanks!

BarclayII · August 16, 2021, 6:53am