Hello! I’m newbie in dgl.
Graph: One item graph (No node features and edge types/features) - #nodes: 37490 - #edges (train): 23,670,982 - #edges (valid) : 55,898 - #edges (test): 5,930,093 Task: Get the graph embedding by means of link prediction in an unsupervised way
I tried to get a graph embedding using GraphSAGE on one big graph as above (seems to be small but big in my local GPU machine). Because I got the GPU OOM problem with full batch. So, the basic idea was to use blocks for mini-batch and negative sampling.
Basically, I followed these two tutorials
- For link prediction
- For mini-batch using dgl.to_block
This is code snippet in Trainer Class and, applied NeighborSampler (based on dgl.sampling.sample_neighbors)
self.g_all.readonly() self.train_eids, self.valid_eids, self.test_eids = self.split_edges(self.g_all) self.g_sub_train = self.g_all.edge_subgraph(self.train_eids, preserve_nodes=True) self.train_nids_src, self.train_nids_dst = self.g_all.find_edges(self.train_eids) self.g_all_hetero = dgl.as_heterograph(self.g_all) self.neighbor_sampler = NeighborSampler(g=self.g_all_hetero, num_fanouts=[10, 25]) def _train_epoch(self, epoch): self.model.train() BATCH_SIZE = 1000 train_dataloader = torch.utils.data.DataLoader(np.unique(self.train_nids_src.numpy()), batch_size=BATCH_SIZE, collate_fn=self.neighbor_sampler.sample, shuffle=True, drop_last=False) features = self.model.encoder.weight for g_train_blocks in train_dataloader: input_nodes = g_train_blocks.srcdata[dgl.NID] input_features = features[input_nodes] output_nodes = g_train_blocks[-1].dstdata[dgl.NID] emb = self.model(g=g_train_blocks, features=input_features) pos_g, neg_g = edge_sampler(self.g_sub_train, self.model.neg_sample_size, return_false_neg=False) pos_score = score_func(pos_g, emb) neg_score = score_func(neg_g, emb) train_loss = torch.mean(NCE_loss(pos_score, neg_score, self.neg_sample_size)) self.optimizer.zero_grad() train_loss.backward() self.optimizer.step() torch.cuda.empty_cache() val_loss, val_mrr = LPEvaluate(self.model.gconv_model, self.g_all, features, self.valid_eids, self.model.neg_sample_size)
However, I encountered the problem in edge_sampler (using dgl.contrib.sampling.EdgeSampler) with block graph.
While using blocked_graph for edge sampler, I found dgl.contrib.sampling.EdgeSampler only support dgl.DGLGraph not dgl.HeteroGraph.
So, I used ‘g_sub_train’ graph (as dgl.DGLGraph) having original number of nodes to get pos_g and neg_g. However, as a result, the ‘emb’ calculated from the model (emb = self.model(g=g_train_blocks, features=input_features)) having different dimension (with #output_nodes). Then, should I integrate this ‘emb’ into original features (with original feature dimension)?
Actually, I don’t know well I’m doing in right way to fulfill the task So, please suggest me anything or reference.