Negative sampling with Neighbor Sampling

kingmbc · July 22, 2020, 1:58am

Hello! I’m newbie in dgl.

Graph: One item graph (No node features and edge types/features)
  - #nodes: 37490
  - #edges (train): 23,670,982
  - #edges (valid) : 55,898
  - #edges (test): 5,930,093
Task: Get the graph embedding by means of link prediction in an unsupervised way

I tried to get a graph embedding using GraphSAGE on one big graph as above (seems to be small but big in my local GPU machine). Because I got the GPU OOM problem with full batch. So, the basic idea was to use blocks for mini-batch and negative sampling.

Basically, I followed these two tutorials

For link prediction

https://github.com/dglai/WWW20-Hands-on-Tutorial/blob/master/_legacy/basic_apps/BasicTasks_pytorch.ipynb

For mini-batch using dgl.to_block

https://github.com/dglai/WWW20-Hands-on-Tutorial/blob/master/large_graphs/large_graphs.ipynb

This is code snippet in Trainer Class and, applied NeighborSampler (based on dgl.sampling.sample_neighbors)

    
    self.g_all.readonly()
    self.train_eids, self.valid_eids, self.test_eids = self.split_edges(self.g_all)
    self.g_sub_train = self.g_all.edge_subgraph(self.train_eids, preserve_nodes=True)
    self.train_nids_src, self.train_nids_dst = self.g_all.find_edges(self.train_eids)
    self.g_all_hetero = dgl.as_heterograph(self.g_all)
    self.neighbor_sampler = NeighborSampler(g=self.g_all_hetero, num_fanouts=[10, 25])
    
    def _train_epoch(self, epoch):
        self.model.train()

        BATCH_SIZE = 1000
        train_dataloader = torch.utils.data.DataLoader(np.unique(self.train_nids_src.numpy()), batch_size=BATCH_SIZE,
                                                       collate_fn=self.neighbor_sampler.sample, shuffle=True,
                                                       drop_last=False)        

        features = self.model.encoder.weight

        for g_train_blocks in train_dataloader:
            input_nodes = g_train_blocks[0].srcdata[dgl.NID]
            input_features = features[input_nodes]
            output_nodes = g_train_blocks[-1].dstdata[dgl.NID]
            emb = self.model(g=g_train_blocks, features=input_features)

            pos_g, neg_g = edge_sampler(self.g_sub_train, self.model.neg_sample_size, return_false_neg=False)
            pos_score = score_func(pos_g, emb)
            neg_score = score_func(neg_g, emb)
            train_loss = torch.mean(NCE_loss(pos_score, neg_score, self.neg_sample_size))

            self.optimizer.zero_grad()
            train_loss.backward()
            self.optimizer.step()
            torch.cuda.empty_cache()


        val_loss, val_mrr = LPEvaluate(self.model.gconv_model, self.g_all, features, self.valid_eids,
                                       self.model.neg_sample_size)

However, I encountered the problem in edge_sampler (using dgl.contrib.sampling.EdgeSampler) with block graph.

While using blocked_graph for edge sampler, I found dgl.contrib.sampling.EdgeSampler only support dgl.DGLGraph not dgl.HeteroGraph.
So, I used ‘g_sub_train’ graph (as dgl.DGLGraph) having original number of nodes to get pos_g and neg_g. However, as a result, the ‘emb’ calculated from the model (emb = self.model(g=g_train_blocks, features=input_features)) having different dimension (with #output_nodes). Then, should I integrate this ‘emb’ into original features (with original feature dimension)?

Actually, I don’t know well I’m doing in right way to fulfill the task So, please suggest me anything or reference.

BarclayII · July 23, 2020, 5:12am

Seems that you want to do link prediction with neighbor sampling. In this case, you could refer to the GraphSAGE unsupervised learning example:

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_unsupervised.py

import dgl
import numpy as np
import torch as th
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.multiprocessing as mp
from torch.utils.data import DataLoader
import dgl.function as fn
import dgl.nn.pytorch as dglnn
import time
import argparse
from _thread import start_new_thread
from functools import wraps
from dgl.data import RedditDataset
from torch.nn.parallel import DistributedDataParallel
import tqdm
import traceback
import sklearn.linear_model as lm
import sklearn.metrics as skm

This file has been truncated. show original

Please feel free to follow up.

Thanks.

kingmbc · July 23, 2020, 6:52am

Thank you so much!
I’ll try this ^^