GAT for GRAPH classification

matgreco · July 18, 2019, 5:14am

Hi there,

Can i use the GAT model for graph classification? The example model require a graph g to learn the attention. It’s posible in some way use GAT for graph classification task?

mufeili · July 18, 2019, 5:19am

I’ve used GAT for graph classification and it works well. As long as you have node features for the graph, GAT will generate attention for you before updating node features.

thiippal · August 11, 2019, 9:04am

What kinds of modifications would have to be done to the GAT tutorial by DGL for batched graph classification?

I see that the Cora dataset consists of a single graph, and the model expects this graph when it is initialised:

net = GAT(g,
          in_dim=features.size()[1],
          hidden_dim=8,
          out_dim=7,
          num_heads=2)

I assume that the reference to the graph g must be removed, but I’m a bit unsure about the modifications needed, as the graph g is referred to throughout.

mufeili · August 11, 2019, 5:32pm

Hi, I’ve made a demo below.

import dgl
import dgl.function as fn
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from dgl.data import MiniGCDataset
from dgl.nn.pytorch import *
from torch.utils.data import DataLoader


class GATLayer(nn.Module):
    def __init__(self,
                 in_dim,
                 out_dim,
                 num_heads,
                 feat_drop=0.,
                 attn_drop=0.,
                 alpha=0.2,
                 agg_activation=F.elu):
        super(GATLayer, self).__init__()

        self.num_heads = num_heads
        self.feat_drop = nn.Dropout(feat_drop)
        self.fc = nn.Linear(in_dim, num_heads * out_dim, bias=False)
        self.attn_l = nn.Parameter(torch.Tensor(size=(num_heads, out_dim, 1)))
        self.attn_r = nn.Parameter(torch.Tensor(size=(num_heads, out_dim, 1)))
        self.attn_drop = nn.Dropout(attn_drop)
        self.activation = nn.LeakyReLU(alpha)
        self.softmax = edge_softmax

        self.agg_activation=agg_activation

    def clean_data(self):
        ndata_names = ['ft', 'a1', 'a2']
        edata_names = ['a_drop']
        for name in ndata_names:
            self.g.ndata.pop(name)
        for name in edata_names:
            self.g.edata.pop(name)

    def forward(self, feat, bg):
        # prepare, inputs are of shape V x F, V the number of nodes, F the dim of input features
        self.g = bg
        h = self.feat_drop(feat)
        # V x K x F', K number of heads, F' dim of transformed features
        ft = self.fc(h).reshape((h.shape[0], self.num_heads, -1))
        head_ft = ft.transpose(0, 1)                              # K x V x F'
        a1 = torch.bmm(head_ft, self.attn_l).transpose(0, 1)      # V x K x 1
        a2 = torch.bmm(head_ft, self.attn_r).transpose(0, 1)      # V x K x 1
        self.g.ndata.update({'ft' : ft, 'a1' : a1, 'a2' : a2})
        # 1. compute edge attention
        self.g.apply_edges(self.edge_attention)
        # 2. compute softmax in two parts: exp(x - max(x)) and sum(exp(x - max(x)))
        self.edge_softmax()
        # 2. compute the aggregated node features scaled by the dropped,
        # unnormalized attention values.
        self.g.update_all(fn.src_mul_edge('ft', 'a_drop', 'ft'), fn.sum('ft', 'ft'))
        # 3. apply normalizer
        ret = self.g.ndata['ft']                                  # V x K x F'
        ret = ret.flatten(1)

        if self.agg_activation is not None:
            ret = self.agg_activation(ret)

        # Clean ndata and edata
        self.clean_data()

        return ret

    def edge_attention(self, edges):
        # an edge UDF to compute un-normalized attention values from src and dst
        a = self.activation(edges.src['a1'] + edges.dst['a2'])
        return {'a' : a}

    def edge_softmax(self):
        attention = self.softmax(self.g, self.g.edata.pop('a'))
        # Dropout attention scores and save them
        self.g.edata['a_drop'] = self.attn_drop(attention)

class GATClassifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, num_heads, n_classes):
        super(GATClassifier, self).__init__()

        self.layers = nn.ModuleList([
            GATLayer(in_dim, hidden_dim, num_heads),
            GATLayer(hidden_dim * num_heads, hidden_dim, num_heads)
        ])
        self.classify = nn.Linear(hidden_dim * num_heads, n_classes)

    def forward(self, bg):
        # For undirected graphs, in_degree is the same as
        # out_degree.
        h = bg.in_degrees().view(-1, 1).float()
        for i, gnn in enumerate(self.layers):
            h = gnn(h, bg)
        bg.ndata['h'] = h
        hg = dgl.mean_nodes(bg, 'h')
        return self.classify(hg)

def collate(samples):
    # The input `samples` is a list of pairs
    #  (graph, label).
    graphs, labels = map(list, zip(*samples))
    batched_graph = dgl.batch(graphs)
    return batched_graph, torch.tensor(labels)

# Create training and test sets.
trainset = MiniGCDataset(320, 10, 20)
testset = MiniGCDataset(80, 10, 20)
# Use PyTorch's DataLoader and the collate function
# defined before.
data_loader = DataLoader(trainset, batch_size=32, shuffle=True,
                         collate_fn=collate)

# Create model
model = GATClassifier(1, 16, 8, trainset.num_classes)
loss_func = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
model.train()

epoch_losses = []
for epoch in range(80):
    epoch_loss = 0
    for iter, (bg, label) in enumerate(data_loader):
        prediction = model(bg)
        loss = loss_func(prediction, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (iter + 1)
    print('Epoch {}, loss {:.4f}'.format(epoch, epoch_loss))
    epoch_losses.append(epoch_loss)

model.eval()
# Convert a list of tuples to two lists
test_X, test_Y = map(list, zip(*testset))
test_bg = dgl.batch(test_X)
test_Y = torch.tensor(test_Y).float().view(-1, 1)
probs_Y = torch.softmax(model(test_bg), 1)
sampled_Y = torch.multinomial(probs_Y, 1)
argmax_Y = torch.max(probs_Y, 1)[1].view(-1, 1)
print('Accuracy of sampled predictions on the test set: {:.4f}%'.format(
    (test_Y == sampled_Y.float()).sum().item() / len(test_Y) * 100))
print('Accuracy of argmax predictions on the test set: {:4f}%'.format(
    (test_Y == argmax_Y.float()).sum().item() / len(test_Y) * 100))

thiippal · August 11, 2019, 6:23pm

Thank you @mufeili, you’re far too kind!

I hope to pay back in form of a very interesting dataset next month!

mufeili · August 11, 2019, 6:53pm

No worries. I’ve done something similar for my other projects, so this is a relatively low hanging fruit.

aah71 · February 27, 2020, 5:23pm

Hi @mufeili, thank you for providing the code for GAT graph classification. Rather than taking the mean of the node representations ( hg = dgl.mean_nodes(bg, ‘h’) ), I would like to perform Conv2D on them.
I would have assumed that the modification would be something like this:

class GATClassifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, num_heads, n_classes):
        super(GATClassifier, self).__init__()
        self.hidden_dim=hidden_dim
        self.num_heads=num_heads
        self.layers = nn.ModuleList([
            GATLayer(in_dim, hidden_dim, num_heads),
            GATLayer(hidden_dim * num_heads, hidden_dim, num_heads)])
        self.classify = nn.Linear(hidden_dim * num_heads, n_classes)

    def forward(self, bg):
        # For undirected graphs, in_degree is the same as
        # out_degree. 
        h = bg.in_degrees().view(-1, 1).float().to(device)
        for i, gnn in enumerate(self.layers):
            h = gnn(h, bg)
        bg.ndata['h'] = h
        mo=nn.Conv2d(self.hidden_dim * self.num_heads,1,3)
        hg=mo(bg.ndata['h'])
        return self.classify(hg)

but so far it is not working. Any hints ?
Regards

zihao · February 27, 2020, 5:40pm

Could you please provide the error message?

aah71 · February 28, 2020, 1:50pm

Hello @zihao, Thank you for your reply.

Below is my error message:
RuntimeError: Expected 4-dimensional input for 4-dimensional weight 1 192 2 2, but got 2-dimensional input of size [26447, 192] instead

for the function below:

def forward(self, bg):
    # For undirected graphs, in_degree is the same as
    # out_degree. 
    h = bg.in_degrees().view(-1, 1).float().to(device)
    for i, gnn in enumerate(self.layers):
        h = gnn(h, bg)
    bg.ndata['h'] = h
    mo=nn.Conv2d(self.hidden_dim * self.num_heads,1,2)
    hg=mo(bg.ndata['h'])
    return self.classify(hg)

My goal is to perform convolution on representations of each graph rather than a batch of graphs, and I believe this is where the challenge is.

Regard!
Ali

mufeili · March 2, 2020, 6:25am

Can you identify where the error happens? Conv2d?

h-cohen · March 28, 2020, 4:33pm

@mufeili Thanks for the great implementation.

I do have one question since I’m relatively new to GAT. How does the architecture handle the node’s data/features?

I’m inputting a graph with 4 simple features to each node (Stored in ndata), where do these features come into play and affect the model?

Thanks

sunjc · March 29, 2020, 3:01pm

Hi, thanks for your example.

If the edge of the network has weights, how can I utilize these weights in GAT to implement the graph classification? Could you give an example in the code.

Also, when I use my own data set, sometimes the value of loss is nan, why is that

mufeili · March 30, 2020, 6:58am

Hi, you may want to check our tutorial on GAT where we explain how node features are used to compute weights (attention) of edges as well as update node representations.

mufeili · March 30, 2020, 7:10am

GAT employs multi-head attention in updating node representations. If you have some prior edge weights, then you can consider these weights as additional non-learnable heads. Let’s say you may have edge_feats, which is a tensor of shape (E, 1) with E being the number of edges. Then you can augment the computed attention with self.g.edata['a_drop'] = torch.cat([self.g.edata['a_drop'], edge_feats], dim=1). Note that you will need to increase the input size of the following GAT layers accordingly.
GATs can be numerically unstable and you may need to try weight initialization methods as here

sunjc · March 31, 2020, 7:05am

Thank you very much for your timely response and I will try your suggestions.

FL33TW00D · August 11, 2020, 9:41am

Hi @mufeili ,

Thanks for this example and the useful comments.
I have run it as is and the predictions is just a tensor of nan’s. Any ideas on a quick fix?

EDIT: Apologies, it was the lack of VRAM, changing from

model = GATClassifier(1, 16, 8, trainset.num_classes)

to

model= GATClassifier(1,4,8,trainset.num_classes)

Fixed it.

Many Thanks

matgreco · August 11, 2020, 10:44pm

Hi

Thank’s to @mufeili to share his code snippet.

Someone can give me any advice about how i should choose the correct size of the hidden dim and the number of heads?

I have graphs with 8 features in its nodes. It’s ok set the hidden dim as 16 or i should set it as a more big number than 16?

BarclayII · August 17, 2020, 7:34am

Hi,

In general the ideal feature size depends on your dataset and model, and it’s usually the topic for hyperparameter tuning & optimization. For instance, if your dataset is small, then it’s usually not a good idea to have a big hidden dimensionality. Of course grid search or random search are always two options.

mikaso · October 6, 2020, 7:52am

Hi @mufeili,

first of all thank you very much for the graph classification implementation!
One question, if I include more than one hidden_layer
(i.e. add another GATLayer(hidden_dim * num_heads, hidden_dim, num_heads)]) - line),
I get NAN values… any idea why?

Thanks in advance!

mufeili · October 6, 2020, 8:19am

Most likely you have some gradient explosion issue.

You can first check the gradient norm before and after adding a GATLayer and see if there is any difference.
DGL now has a built-in support for GATLayer, see GATConv, which is probably more robust.
You can try some common techniques to see if you can make the training more robust, e.g. adding residual layers.