How to Enhance Generalization in Graph Neural Networks using DGL

Hi everyone, I need guidance regarding Graph Neural Networks (GNNs). One example that I’ve struggled to understand well involves working with abstract graphs in our training dataset. Specifically, I aim to predict the degree of every node. Ideally, when applying graph regression, it should provide me with the average degree of the graph. While training such a model with low loss is straightforward, the challenge lies in its generalization capability to graphs of different sizes or distributions.

The setup I’m using involves GNN layers (various types like GIN, GCN, SAGE, etc.), adding self-loops, and incorporating dummy features at the node level (e.g., degree). However, I’ve encountered an issue where the model fails to learn relevant information; instead, it simply averages the node degrees regardless of the graph’s structure.

I initially believed that one GNN layer should suffice to capture neighborhood information. For aggregation, I’ve experimented with summation, averaging, and even max aggregation, yet none have resolved the problem. Currently, I use summation pooling layers alongside MSE loss.

What do you think could be the solution to this problem? Is it possible to generalize beyond the distribution of the training data, or is the issue rooted in the model learning a mapping that minimizes loss without capturing the correct relationships? I seek guidance on how to address this challenge and improve my understanding of GNN capabilities. If you know of any relevant papers addressing this problem, please share them with me so I can identify where I might be going wrong.

Hi @walidgeuttala , Could you elaborate more on your experimental setting?

Predicting the degree of every node means node-level regression, but you mentioned that you apply graph regression as the task.

Since you provided node degrees as the features, with an average pooling layer, it means each node may simply refuse message passing on the graphs to reserve the degree information. With the summation pooling, the parameters just need to fit and memorize the graph size.

Hi @dyru , thank you for your comment. So I have taken, for example, the MUTAG dataset and removed the node features, replacing them with one scalar feature for each node or a degree of the node to each node. I apply k GIN block layers containing the GIN layer with aggregation summation, and after each GIN layer, I apply ReLU and batch normalization. At the end of the embedding of the nodes of the last block, I apply an average pooling layer. The label is for a graph regression task of the average degree of the graph. When I train the model, it gives very low loss for the train and the test1 dataset, which is part of the MUTAG dataset. However, I also have another dataset which is a synthetic dataset of the Barabási-Albert, Watts-Strogatz, grid graph, and Erdős-Rényi of sizes between 250 nodes to 1025 nodes. These graphs have the average degree set mostly around 4 and 8, which means averaged around 6, but the loss is around 46 using the MSE loss, which means the error is around 6.78. The model somehow predicts very poorly, and it could not learn to just pass the value through the GIN layers. Any explanation for this?

In the case of one feature at each node, the aggregation of the GIN will result in the degree of the node. Then, the average pooling for the graph level will calculate the average degree, which makes sense. The model needs to learn to sum the one feature at the node level.

On the other hand, for the degree feature, since the node has a self-loop, it should ignore all the messages it receives from the neighbors and only consider the feature from the self-loop. This information should then be saved for the average pooling layer at the graph level. However, the model seems not to learn these two cases.