Hello! I’m newbie in dgl.
Graph: One item graph (No node features and edge types/features)
- #nodes: 37490
- #edges (train): 23,670,982
- #edges (valid) : 55,898
- #edges (test): 5,930,093
Task: Get the graph embedding by means of link prediction in an unsupervised way
I tried to get a graph embedding using GraphSAGE on one big graph as above (seems to be small but big in my local GPU machine). Because I got the GPU OOM problem with full batch. So, the basic idea was to use blocks for mini-batch and negative sampling.
Basically, I followed these two tutorials
- For link prediction
- For mini-batch using dgl.to_block
This is code snippet in Trainer Class and, applied NeighborSampler (based on dgl.sampling.sample_neighbors)
self.g_all.readonly()
self.train_eids, self.valid_eids, self.test_eids = self.split_edges(self.g_all)
self.g_sub_train = self.g_all.edge_subgraph(self.train_eids, preserve_nodes=True)
self.train_nids_src, self.train_nids_dst = self.g_all.find_edges(self.train_eids)
self.g_all_hetero = dgl.as_heterograph(self.g_all)
self.neighbor_sampler = NeighborSampler(g=self.g_all_hetero, num_fanouts=[10, 25])
def _train_epoch(self, epoch):
self.model.train()
BATCH_SIZE = 1000
train_dataloader = torch.utils.data.DataLoader(np.unique(self.train_nids_src.numpy()), batch_size=BATCH_SIZE,
collate_fn=self.neighbor_sampler.sample, shuffle=True,
drop_last=False)
features = self.model.encoder.weight
for g_train_blocks in train_dataloader:
input_nodes = g_train_blocks[0].srcdata[dgl.NID]
input_features = features[input_nodes]
output_nodes = g_train_blocks[-1].dstdata[dgl.NID]
emb = self.model(g=g_train_blocks, features=input_features)
pos_g, neg_g = edge_sampler(self.g_sub_train, self.model.neg_sample_size, return_false_neg=False)
pos_score = score_func(pos_g, emb)
neg_score = score_func(neg_g, emb)
train_loss = torch.mean(NCE_loss(pos_score, neg_score, self.neg_sample_size))
self.optimizer.zero_grad()
train_loss.backward()
self.optimizer.step()
torch.cuda.empty_cache()
val_loss, val_mrr = LPEvaluate(self.model.gconv_model, self.g_all, features, self.valid_eids,
self.model.neg_sample_size)
However, I encountered the problem in edge_sampler (using dgl.contrib.sampling.EdgeSampler) with block graph.
-
While using blocked_graph for edge sampler, I found dgl.contrib.sampling.EdgeSampler only support dgl.DGLGraph not dgl.HeteroGraph.
-
So, I used ‘g_sub_train’ graph (as dgl.DGLGraph) having original number of nodes to get pos_g and neg_g. However, as a result, the ‘emb’ calculated from the model (emb = self.model(g=g_train_blocks, features=input_features)) having different dimension (with #output_nodes). Then, should I integrate this ‘emb’ into original features (with original feature dimension)?
Actually, I don’t know well I’m doing in right way to fulfill the task So, please suggest me anything or reference.