Seeking Advice on Training a Medium-Sized Heterogeneous Graph Network

Azai-yx · November 15, 2023, 12:17pm

Recently, I’ve been training a medium-sized heterogeneous graph network. Specifically, the network design starts with updating the node embeddings of all approximately 140,000 nodes in the graph using two layers of rgcn_layer, based on the relationship types in the heterogeneous graph. Subsequently, node indices for node pairs are obtained from the training set edges (src_nodes, edge, dst_nodes), and these indices are used to retrieve the corresponding node embeddings. A custom linear layer is then used to predict the relationship type between these node pairs. Finally, cross-entropy loss for the prediction of relationship types is calculated for model updates.

In practice, I’ve also generated negative sample pairs, treating “non-existent edges” as a type of edge for prediction. This means that in reality, my graph has 22 edge types plus one negative sample edge type, and I’ve calculated the cross-entropy loss for all of them together.

However, something unexpected yet somewhat predictable occurred: I found that regardless of the batch_size setting, the GPU memory usage does not change significantly, staying around 20 GB. I have tried to interpret this phenomenon myself. Perhaps it’s because the node embeddings before and after the update are already in the GPU memory, so the memory usage doesn’t change much with the number of node pairs I try to index.

As I don’t have much experience in training graph models before, I would like to know if this approach is reasonable. What potential issues could it bring? For now, the training process seems to be running smoothly…

minjie · November 23, 2023, 1:59am

Hi, for the memory consumption, it is common to see it stays at a certain amount because PyTorch internally manages memory via paging so some unused memory will still be there to accelerate next memory request. Your reasoning also makes sense to me. It is possible that the consumption of each mini-batch is tiny compared with node embeddings.

I don’t see significant risk from your description. Of course, devils are always in the details. Good luck with your modeling!

system · December 23, 2023, 1:59am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.