I have a training dataset with shape N*F, N denotes the training sample number and F denotes the the number of fields. There is no structure information in the dataset so I want to construct a complete graph (F nodes) for each training sample. The node embedding can be obtained from a torch.nn.Embedding. This case is common. For example, if F fields denote F modalities, then this is the same case as discussed in Self attention and feature fusion over graphs - #3. In addition, the constructed graph is a complete graph and the structure is the same for all training samples (X_i, i=1,…,N). I also found a topic discuss this: Batch same structured graph.
Basically, I have two ways to load the dataset into DGL. The first way is to construct all graphs in advance and then use Dataloader to generate batch graphs. The second way is to construct graphs when loading batches.
(I write the code in PYG manner, but I think it is the same in DGL)
More specifically, the following is the first method:
class GraphDataset(torch_geometric.data.Dataset): def __init__(self, trainingData): super().__init__(trainingData) self.trainingData = trainingData self.size = self.trainingData.shape # N self.num_features = self.trainingData.shape # F self.src_nodes, self.dst_nodes = zip(*list(product(range(self.num_fields), repeat=2))) # Complete graph edge index self.graphs = [Data(x = darray[idx], edge_index = torch.tensor([list(self.src_nodes),list(self.dst_nodes)], dtype=torch.long)) for idx in range(self.size)] # Data object is from PYG, It can be replaced with DGL.graph object in the same way. # However, when N is large, this will case OOM. def get(self,index): graphs = self.graphs[index] y = self.trainingData[index, -1] return graphs, y
The second method:
class GraphDataset(torch_geometric.data.Dataset): def __init__(self, trainingData): super().__init__(trainingData) self. trainingData = trainingData self.size = self. trainingData.shape # N self.num_features = self. trainingData.shape # F def get(self,index): X = self.trainingData[index, :] src_nodes, dst_nodes = zip(*list(product(range(self.num_features), repeat=2))) # Complete Graph edge index graphs = Data(x = torch.tensor(X), edge_index = torch.tensor([list(src_nodes),list(dst_nodes)], dtype=torch.long)) # We only process batch data here, but it is very time comsuming. y = self.trainingData[index, -1] return graphs, y
When N is large, the first method is out-of-memory, and the second method is very slow. Slow means that when I set batch size as 2048 and F set as 24, generating graph object (i.e., PYG.Data or DGL.Graph) will cause additional 20 minutes overhead. (If I just use non-graph data, one epoch costs 35min. If I generate graph object and use Dataloader workers and Pin_Memory, it will costs 55min. If I do not use Dataloader workers, it will costs 2 hours! In fact, I am very confused that generating 2048 graph objects costs so much time.)
Is there are any memory/computation friendly way or tricks for handling batches of complete graphs? In addition, I think allocate each training samples a compete graph and store the (training sample, graph) in the disk is not applicable since we still need to load the whole dataset into memory and it will become the same as the first method.
Edit: I just replace the PYG with the DGL and utilizing the dgl.dataloading.GraphDataLoader:
class GraphDataset(dgl.data.DGLDataset): def __init__(self, trainingData): super().__init__(trainingData) self. trainingData = trainingData self.size = self. trainingData.shape # N self.num_features = self. trainingData.shape # F def __getitem__(self,index): X = self. trainingData[index, 0:-1] graph = dgl.graph( (torch.tensor(list(self.src_nodes), dtype=torch.long),torch.tensor(list(self.dst_nodes), dtype=torch.long)) ) graph.ndata['x'] = torch.tensor(X) y = torch.tensor(self. trainingData[index, -1]) return graph, y
It is as fast as that not using graph object now! It surprised me a lot! I used to think that It may be the same for PYG and DGL.