Activation function of last layer and Reproducibility of the results grahSAGE

lf.roa10 · September 30, 2020, 8:32pm

Hi everyone!

I’m working on a node classification problem using graphSAGE. I’m new to GNN so I’m following the tutorials of GraphSAGE for classification task [1] and [2]. The code seems to be clear for me, but i have several doubts.

Doubt 1:
According to the algorithm presented in the paper, all the layers go through an activation function, why in these codes the last layer does not go through an activation function?. So, in a classification problem, would it be correct to put a softmax or sigmoid function in that last layer? or What should be my activation function for the last layer?

Doubt 2:
For my problem, I selected a small portion of the graph with 128 nodes (14 labeled as 1 and 114 as 0), 505 edges and each node have 17 attributes. This is the code that I’m using, its a 3 layer GNN with imput size 17 and output size 2 (binary classification problem):

class GraphSAGE(nn.Module):
    def __init__(self,in_feats,n_hidden,n_classes,n_layers,
                 activation,dropout,aggregator_type):
        super(GraphSAGE, self).__init__()
        self.layers = nn.ModuleList()
        self.dropout = nn.Dropout(dropout)
        self.activation = activation

        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, aggregator_type))
        for i in range(n_layers - 1):
            self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, aggregator_type))
        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, aggregator_type))

    def forward(self, graph, inputs):
        h = self.dropout(inputs)
        for l, layer in enumerate(self.layers):
            h = layer(graph, h)
            if l != len(self.layers) - 1:
                h = self.activation(h)
                h = self.dropout(h)
        return h

modelG = GraphSAGE(in_feats=n_features, #20
                   n_hidden=16,
                   n_classes=n_labels, #2
                   n_layers=3,
                   activation=F.relu,
                   dropout=0,
                   aggregator_type='mean')

opt = torch.optim.Adam(modelG.parameters())

for epoch in range(50):
    modelG.train() 

    logits = modelG(g, node_features)
    
    loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])
    
    acc = evaluate(modelG, g, node_features, node_labels, valid_mask)
    
    opt.zero_grad()
    loss.backward()
    opt.step()
    
    if epoch % 5 == 0:
        print('In epoch {}, loss: {}'.format(epoch, loss),)

Every time I train the model (without changing anything), the performance changes a lot, the acurracy varies between 0.64 and 0.87. How can I guarantee the reproducibility of the results? I have tried setting the pytorch seed torch.manual_seed(), numpy seed and set the drop out to 0 but the results keep varying. Is this normal or am I missing something?

Thanks !!!

mufeili · October 1, 2020, 5:55am

For the last layer, you can either directly put a softmax/sigmoid function after it and use a loss function like BCELoss or not put a softmax/sigmoid function after it and use a loss function like BCEWithLogitsLoss which combines a sigmoid layer and a BCELoss. I personally prefer the latter one, which can be more stable numerically.
Which DGL version are you using? As of 0.5, DGL should have fixed the randomness in computation. Meanwhile, assuming some randomness exists in DGL or PyTorch, your graph is probably too small to yield a relatively stable result.

lf.roa10 · October 1, 2020, 1:09pm

Thanks @mufeili !!!

I will definitely try your suggestion.
I’m using dgl==0.5.1 and I hadn’t thought it was because of the size of the graph. Thanks again, I will see the problem with the complete graph.

mufeili · October 1, 2020, 1:33pm

Have you fixed all random seeds like below?

import numpy as np
import random
import torch

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)

If so, then it’s strange that there is a non-deterministic behavior. Can you share a code snippet for reproducing the issue?

By the size of the graph, I am referring to the number of nodes as you are dealing with node classification.

lf.roa10 · October 1, 2020, 2:43pm

Yes, the seeds that I’m setting are:

np.random.seed (10)
random.seed (10)
torch.manual_seed (10)

Im not setting cuda because im not using GPU.

Actually, I’m not having an error in the code but the results of accuraccy, sensitivity and specificity vary in a proportion that I think is high considering that seeds are set. This is my complete code:

class GraphSAGE(nn.Module):
    def __init__(self,in_feats,n_hidden,n_classes,n_layers,
                 activation,dropout,aggregator_type):
        super(GraphSAGE, self).__init__()
        self.layers = nn.ModuleList()
        self.dropout = nn.Dropout(dropout)
        self.activation = activation

        # input layer
        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, aggregator_type))
        # hidden layers
        for i in range(n_layers - 1):
            self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, aggregator_type))
        # output layer
        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, aggregator_type)) # activation None

    def forward(self, graph, inputs):
        h = self.dropout(inputs)
        for l, layer in enumerate(self.layers):
            h = layer(graph, h)
            if l != len(self.layers) - 1:
                h = self.activation(h)
                h = self.dropout(h)
        return h

def evaluate(model, graph, features, labels, mask):
    model.eval() # will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode
    with torch.no_grad(): #impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computation
        logits = model(graph, features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1) #Returns a namedtuple (values, indices) where values is the maximum value of each row of the input tensor in the given dimension dim. And indices is the index location of each maximum value found (argmax)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

modelG = GraphSAGE(in_feats=n_features, # 17
                   n_hidden=16,
                   n_classes=n_labels, #2
                   n_layers=3,
                   activation=F.relu,
                   dropout=0,
                   aggregator_type='mean')

opt = torch.optim.Adam(modelG.parameters())

for epoch in range(50):
    modelG.train() #tells your model that you are training the model. So effectively layers like dropout, batchnorm etc. which behave different on the train and test procedures know what is going on and hence can behave accordingly.
    # forward propagation by using all nodes
    logits = modelG(g, node_features)
    # compute loss
    loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])
    # compute validation accuracy
    acc = evaluate(modelG, g, node_features, node_labels, valid_mask)
    # backward propagation
    opt.zero_grad()
    loss.backward()
    opt.step()
    
    if epoch % 5 == 0:
        print('In epoch {}, loss: {}'.format(epoch, loss),)

pred = torch.argmax(logits, axis=1)
print('Accuracy', (pred == node_labels).sum().item() / len(pred))

def predictedlab(model, features, labels, mask):
    modelG.eval()
    with torch.no_grad():
        logits = model(g,features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
       
    return indices

y_pred = predictedlab(modelG,  node_features, node_labels, valid_mask)
tn, fp, fn, tp = confusion_matrix(node_labels[valid_mask], y_pred).ravel()

print("Sensitivity {:.4f}".format(tp / (tp + fn)))
print("Specificity {:.4f}".format(tn / (tn + fp)))

So, when running the Code above I get the following results:

Accuracy 0.835937
Sensitivity 0.4000
Specificity 0.7619

They are good results, what happens is that when running the code multiple times the results vary in a proportion that I think is high considering that seeds are set. For example, I get these results:

Accuracy 0.625
Sensitivity 0.6000
Specificity 0.5714

mufeili · October 2, 2020, 6:14am

I tried running your code using a synthetic dataset, which does seem to yield a deterministic behavior across runs.

import dgl
import dgl.nn.pytorch as dglnn
import numpy as np
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
from scipy.sparse import rand

class GraphSAGE(nn.Module):
    def __init__(self,in_feats,n_hidden,n_classes,n_layers,
                 activation,dropout,aggregator_type):
        super(GraphSAGE, self).__init__()
        self.layers = nn.ModuleList()
        self.dropout = nn.Dropout(dropout)
        self.activation = activation

        # input layer
        self.layers.append(dglnn.SAGEConv(in_feats, n_hidden, aggregator_type))
        # hidden layers
        for i in range(n_layers - 1):
            self.layers.append(dglnn.SAGEConv(n_hidden, n_hidden, aggregator_type))
        # output layer
        self.layers.append(dglnn.SAGEConv(n_hidden, n_classes, aggregator_type)) # activation None

    def forward(self, graph, inputs):
        h = self.dropout(inputs)
        for l, layer in enumerate(self.layers):
            h = layer(graph, h)
            if l != len(self.layers) - 1:
                h = self.activation(h)
                h = self.dropout(h)
        return h

def evaluate(model, graph, features, labels, mask):
    model.eval() # will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode
    with torch.no_grad(): #impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computation
        logits = model(graph, features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1) #Returns a namedtuple (values, indices) where values is the maximum value of each row of the input tensor in the given dimension dim. And indices is the index location of each maximum value found (argmax)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

seed = 10
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)

modelG = GraphSAGE(in_feats=1,
                   n_hidden=16,
                   n_classes=2,
                   n_layers=3,
                   activation=F.relu,
                   dropout=0,
                   aggregator_type='mean')

opt = torch.optim.Adam(modelG.parameters())

# 128 nodes, 14 labeled as 1, 114 as 0
num_nodes = 128
num_pos_nodes = 14
density = 0.05
adj = rand(num_nodes, num_nodes, density=density)
g = dgl.from_scipy(adj)
node_features = torch.ones(num_nodes, 1)
node_labels = torch.zeros(num_nodes).long()
pos_idx = torch.randperm(num_nodes)[:num_pos_nodes]
node_labels[pos_idx] = 1.
num_train_nodes = int(128 * 0.6)
train_mask = torch.zeros(num_nodes).long()
train_mask[:num_train_nodes] = 1
valid_mask = torch.zeros(num_nodes).long()
valid_mask[num_train_nodes:] = 1

for epoch in range(50):
    modelG.train() #tells your model that you are training the model. So effectively layers like dropout, batchnorm etc. which behave different on the train and test procedures know what is going on and hence can behave accordingly.
    # forward propagation by using all nodes
    logits = modelG(g, node_features)
    # compute loss
    loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])
    # compute validation accuracy
    acc = evaluate(modelG, g, node_features, node_labels, valid_mask)
    # backward propagation
    opt.zero_grad()
    loss.backward()
    opt.step()
    
    if epoch % 5 == 0:
        print('In epoch {}, loss: {}'.format(epoch, loss),)

pred = torch.argmax(logits, axis=1)
print('Accuracy', (pred == node_labels).sum().item() / len(pred))

def predictedlab(model, features, labels, mask):
    modelG.eval()
    with torch.no_grad():
        logits = model(g,features)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
       
    return indices

y_pred = predictedlab(modelG,  node_features, node_labels, valid_mask)

lf.roa10 · October 5, 2020, 7:42pm

Thanks for your help @mufeili ! I’m going to check my data.