Training with the random generated synthetic features

qillbel · July 13, 2020, 4:18am

Dear all,

I’ve been working on the implementation of node and edge features in GCN for the binary node classification with some random synthetic generated features (node & edge). The training of my result always shows the accuracy > 1.00 which is not true. Loss also looks wrong…

Epoch 00000 | Loss 0.6931 | Train Acc 3393.0000 | Test Acc 890.0000 | Time(s) 0.0772
Epoch 00001 | Loss 0.6931 | Train Acc 3393.0000 | Test Acc 890.0000 | Time(s) 0.1543
Epoch 00002 | Loss 0.6931 | Train Acc 3393.0000 | Test Acc 890.0000 | Time(s) 0.2339

I’m not sure if the random synthetic data I generated caused it. I usually randomize node features to see if my architecture runs without error or not, and usually, it still gives me accuracy < 1.0.

So, does anybody also had the same problem? or can anybody show me a simple implementation of using node and edge feature for the binary node classification?

Thanks all

mufeili · July 13, 2020, 5:51am

For accuracy numbers greater than 1, most likely you have not divided the number of correct predictions by the number of samples. For unchanged loss, you can print loss for each iteration to check whether it really decreases for the first few iterations. If not, most likely your forgot to call optimizer.step() or the computation graph is broken somewhere for backward propagation.

qillbel · July 13, 2020, 10:56am

Thanks @mufeili,

I think I have divided the number of correct predictions by the number of samples with return correct.item() * 1.0 / len(labels).

And do call optimizer.step() after I call the backward function.

So I’m thinking now that the computation graph is broken. If this is the case, how do I check if the computation graph is broken?

Thanks so much

I

mufeili · July 13, 2020, 6:34pm

Can you check the shape of labels? Also how is correct computed?
After you called loss.backward(), check if model parameters have gradients generated. If not, then gradient is broken there.

qillbel · July 13, 2020, 7:24pm

Thanks @mufeili,

The size of my labels are torch.Size([5846, 1]) and I calculate the correct as follow:

 def evaluate(model, g, nfeats, efeats, labels, mask):
    model.eval()
    with torch.no_grad():
        logits = model(g, nfeats, efeats)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

I print out the g.ndata['h'] and g.ndata['h_neigh'] in each layer with attribute grad_fn=<ReluBackward0> and grad_fn=<CopyReduceBackward> respectively. My implementation is actually still in Accuracy of training beyond 1.0

Thanks so much

mufeili · July 15, 2020, 5:24am

Your code block looks fine. But since accuracy is wrong, something must be wrong. I’d recommend do more printing about intermediary results to figure out what’s going on.
Can you print the loss to see if loss changes over epochs?

qillbel · July 15, 2020, 7:32pm

Thanks @mufeili,

I tried to print out the intermediary results which is shown below:

logits: tensor([[0.0356],
        [0.4634],
        [0.0000],
        ...,
        [0.0084],
        [0.0000],
        [0.3191]], grad_fn=<ReluBackward0>)
logp: tensor([0., 0., 0.,  ..., 0., 0., 0.], grad_fn=<ViewBackward>) torch.Size([5707])
loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Correct: tensor(947)  | Labels: tensor([1., 0., 1.,  ..., 1., 1., 0.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
test_acc: 0.4937434827945777
Correct: tensor(993)  | Labels: tensor([0., 1., 1.,  ..., 1., 1., 1.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
train_acc: 0.5105398457583548
tst_loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Epoch 00000 | Loss 0.6931 | Train Acc 0.5105 | Test Acc 0.4937 | Time(s) 0.0491
***************************************************************************
logits: tensor([[0.1085],
        [0.2853],
        [0.0793],
        ...,
        [0.5197],
        [0.1882],
        [0.2970]], grad_fn=<ReluBackward0>)
logp: tensor([0., 0., 0.,  ..., 0., 0., 0.], grad_fn=<ViewBackward>) torch.Size([5707])
loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Correct: tensor(947)  | Labels: tensor([1., 0., 1.,  ..., 1., 1., 0.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
test_acc: 0.4937434827945777
Correct: tensor(993)  | Labels: tensor([0., 1., 1.,  ..., 1., 1., 1.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
train_acc: 0.5105398457583548
tst_loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Epoch 00001 | Loss 0.6931 | Train Acc 0.5105 | Test Acc 0.4937 | Time(s) 0.0918
***************************************************************************
logits: tensor([[0.0543],
        [0.4011],
        [0.0465],
        ...,
        [0.2958],
        [0.0811],
        [0.2602]], grad_fn=<ReluBackward0>)
logp: tensor([0., 0., 0.,  ..., 0., 0., 0.], grad_fn=<ViewBackward>) torch.Size([5707])
loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Correct: tensor(947)  | Labels: tensor([1., 0., 1.,  ..., 1., 1., 0.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
test_acc: 0.4937434827945777
Correct: tensor(993)  | Labels: tensor([0., 1., 1.,  ..., 1., 1., 1.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
train_acc: 0.5105398457583548
tst_loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Epoch 00002 | Loss 0.6931 | Train Acc 0.5105 | Test Acc 0.4937 | Time(s) 0.1351
***************************************************************************
logits: tensor([[0.0834],
        [0.3571],
        [0.0044],
        ...,
        [0.0896],
        [0.3438],
        [0.2082]], grad_fn=<ReluBackward0>)
logp: tensor([0., 0., 0.,  ..., 0., 0., 0.], grad_fn=<ViewBackward>) torch.Size([5707])
loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Correct: tensor(947)  | Labels: tensor([1., 0., 1.,  ..., 1., 1., 0.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
test_acc: 0.4937434827945777
Correct: tensor(993)  | Labels: tensor([0., 1., 1.,  ..., 1., 1., 1.])  | Indices: tensor([0, 0, 0,  ..., 0, 0, 0])
train_acc: 0.5105398457583548
tst_loss tensor(0.6931, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)
Epoch 00003 | Loss 0.6931 | Train Acc 0.5105 | Test Acc 0.4937 | Time(s) 0.1768

Although the accuracy now is between 0 and 1, but the loss and accuracy (both train and test) do not changed at all even when I set the epoch to be 100. So, I still think that something is going wrong there. But I could not figure out what causes it.

Thanks all for helping me out.

mufeili · July 16, 2020, 5:29am

I assume the input node features are treated as given and they are not learned from scratch?
You can check the gradient of parameters and see if they are nonzero. An example is given below:

import torch
feats = torch.randn(2, 2)
scores = torch.randn(2, 1)
model = torch.nn.Linear(2, 1)
pred = model(feats)
loss = ((pred - scores) ** 2).sum()
loss.backward()
for param in list(model.parameters()):
     print(param.grad)

qillbel · July 16, 2020, 12:45pm

Thanks @mufeili,

Yes, they (node features) are treated as given. It has a size of torch.Size([5925, 9]). But Im not sure what do you mean by “they are not learned from scratch?”.
I discovered that the gradient of parameters is all zeros. How should I fix it? and what causes it?

Thanks so much

mufeili · July 16, 2020, 6:33pm

There can be cases where node features are embeddings learned from scratch, in which case additional issues might rise.
That makes sense since your loss does not change at all. Can you disable dropout and see if the issue still exists? Also, did you do anything special for weight initialization?

qillbel · July 18, 2020, 6:18pm

Thanks @mufeili,

I think my node features are not embedding learned from scratch
I tried to disable the dropout, but the problem still persists.

I tried to simplify the node as simple as possible, but I cannot find something wrong with it.

class GNNLayer3(nn.Module):
  def __init__(self, ndim_in, edims, ndim_out):
    super(GNNLayer3, self).__init__()
    print("ndim_in:",ndim_in, "ndim_out:",ndim_out, "edims:",edims)
    self.W_msg = nn.Linear(ndim_in + edims, ndim_out) 
    self.W_apply = nn.Linear(ndim_in + ndim_out, ndim_out) 
    
  def message_func(self, edges):
    msg = torch.cat([edges.src['h'],edges.data['h']], 1)
    return {'m': F.relu(self.W_msg(msg))}  
    
  def forward(self, gc, nfeats, efeats):
    with gc.local_scope():
      gc.ndata['h'] = nfeats; gc.edata['h'] = efeats          
      gc.update_all(self.message_func, 
                   fn.sum('m', 'h_neigh'))      
      app = torch.cat([gc.ndata['h'], gc.ndata['h_neigh']], 1)
      gc.ndata['h'] = F.relu(self.W_apply(app))
      return gc.ndata['h']
# ===================================================================

class NetArchie3(nn.Module):
  def __init__(self, ndim_in, ndim_out, edim):
    super(NetArchie3, self).__init__()
    self.layers = nn.ModuleList()
    self.layers.append(GNNLayer3(ndim_in, edim, 50))
    self.layers.append(GNNLayer3(50, edim, ndim_out))

  def forward(self, g_dgl, nfeats, efeats):
    for i, layer in enumerate(self.layers):
      nfeats = layer(g_dgl, nfeats, efeats)
    return nfeats
# ===================================================================

model = NetArchie3(9, 1, 2)

def evaluate(model, g, nfeats, efeats, labels, mask):
    model.eval()
    with torch.no_grad():
        logits = model(g, nfeats, efeats)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

for epoch in range(10):
  model.train()
  logits = model(gc, gc.ndata['hn'], gc.edata['he'])
  logp = F.log_softmax(logits, 1)
  logpr = logp.view(gc.number_of_nodes(),)
  loss = loss_fcn(logpr[train_mask], labels[train_mask])
  optimizer.zero_grad(); loss.backward(); optimizer.step()
  test_acc = evaluate(model, gc, gc.ndata['hn'], gc.edata['he'], labels, test_mask)
  print("test_acc:",test_acc, "loss", loss.items())

and the evaPerhaps you can spot something peculiar in the design. Thanks mufeili.

mufeili · July 19, 2020, 10:53am

Can you give a code snippet that I can run directly to reproduce the issue? My guess is that something is wrong in these lines:

logp = F.log_softmax(logits, 1)
logpr = logp.view(gc.number_of_nodes(),)
loss = loss_fcn(logpr[train_mask], labels[train_mask])

qillbel · July 19, 2020, 1:06pm

Thanks for your time mufeili,

Although I modify the suspicious lines you mentioned above, the result still the same. Here is the full version of the code

import torch.nn as nn
import torch.nn.functional as F
import random

g_nx = nx.random_regular_graph(random.randint(1,3),100)
gc = dgl.DGLGraph()
gc.from_networkx(g_nx) 

gc.ndata['hn'] = torch.randint(0, 2, (gc.number_of_nodes(),9), dtype=torch.float32)
gc.edata['he'] = torch.randint(0, 2, (gc.number_of_edges(),2), dtype=torch.float32)
labels = torch.randint(0, 2, (gc.number_of_nodes(),1), dtype=torch.float32)

TTE_possible = [[True,False,False],[False,True,False],[False,False,True]]
TTE_indices = np.random.choice(len(TTE_possible), gc.number_of_nodes(), replace=True)
TTE = [TTE_possible[i] for i in TTE_indices]
train_mask = torch.tensor([TTE[i][0] for i in range(len(TTE))])
test_mask = torch.tensor([TTE[i][1] for i in range(len(TTE))])
eval_mask = torch.tensor([TTE[i][2] for i in range(len(TTE))])

class GNNLayer3(nn.Module):
  def __init__(self, ndim_in, edims, ndim_out):
    super(GNNLayer3, self).__init__()
    #print("ndim_in:",ndim_in, "ndim_out:",ndim_out, "edims:",edims)
    self.W_msg = nn.Linear(ndim_in + edims, ndim_out) 
    self.W_apply = nn.Linear(ndim_in + ndim_out, ndim_out) 
    
  def message_func(self, edges):
    msg = torch.cat([edges.src['h'],edges.data['h']], 1)
    return {'m': F.relu(self.W_msg(msg))}  
    
  def forward(self, gc, nfeats, efeats):
    with gc.local_scope():
      gc.ndata['h'] = nfeats; gc.edata['h'] = efeats          
      gc.update_all(self.message_func, 
                   fn.sum('m', 'h_neigh'))      
      app = torch.cat([gc.ndata['h'], gc.ndata['h_neigh']], 1)
      gc.ndata['h'] = F.relu(self.W_apply(app))
      return gc.ndata['h']
# ===================================================================

class NetArchie3(nn.Module):
  def __init__(self, ndim_in, ndim_out, edim):
    super(NetArchie3, self).__init__()
    self.layers = nn.ModuleList()
    self.layers.append(GNNLayer3(ndim_in, edim, 50))
    self.layers.append(GNNLayer3(50, edim, ndim_out))

  def forward(self, g_dgl, nfeats, efeats):
    for i, layer in enumerate(self.layers):
      nfeats = layer(g_dgl, nfeats, efeats)
    return nfeats
# ===================================================================

model = NetArchie3(9, 1, 2)

def evaluate(model, g, nfeats, efeats, labels, mask):
    model.eval()
    with torch.no_grad():
        logits = model(g, nfeats, efeats)
        logits = logits[mask]
        labels = labels[mask]
        _, indices = torch.max(logits, dim=1)
        correct = torch.sum(indices == labels)
        return correct.item() * 1.0 / len(labels)

for epoch in range(10):
  model.train()
  logits = model(gc, gc.ndata['hn'], gc.edata['he'])
  logp = F.log_softmax(logits, 1)
  loss = loss_fcn(logp[train_mask], labels[train_mask])
  optimizer.zero_grad(); loss.backward(); optimizer.step()
  test_acc = evaluate(model, gc, gc.ndata['hn'], gc.edata['he'], labels, test_mask)
  print("test_acc:",test_acc, "loss", loss.item())

mufeili · July 19, 2020, 3:28pm

How did you define loss_fcn and optimizer?

qillbel · July 19, 2020, 3:59pm

I’m sorry, I forgot to mention them. They are defined as

optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
loss_fcn = torch.nn.BCEWithLogitsLoss()

mufeili · July 20, 2020, 3:55am

Then you simply need to replace

logp = F.log_softmax(logits, 1)
loss = loss_fcn(logp[train_mask], labels[train_mask])

with

loss = loss_fcn(logits[train_mask], labels[train_mask])

The reason is that BCEWithLogitsLoss will perform logsoftmax in loss computation and you do not need to perform it yourself.

qillbel · July 20, 2020, 8:30am

Thanks Mufeili,

I should have realized it. But although the loss now decreases my accuracy doesn’t change even when I now apply dropout. Does it affect the evaluate function, because I just use that function from https://docs.dgl.ai/en/0.4.x/tutorials/models/1_gnn/1_gcn.html?

Thanks mufeili

mufeili · July 20, 2020, 2:21pm

Did you check training accuracy?

qillbel · July 20, 2020, 7:25pm

Unfortunately, the training accuracy behaves similar with the test accuracy. They do not change.

However, I suspect because the shape of my target label is generated
labels = torch.randint(0, 2, (gc.number_of_nodes(),1), dtype=torch.float32) which has different shape with the shape of labels in Cora dataset. But the architecture I made already accept the shape I defined and reshaping it will cause an error in a different dimension in the loss function as it says “ValueError: Target size (torch.Size([33])) must be the same as input size (torch.Size([33, 1]))”.

Thank you very much mufeili

mufeili · July 20, 2020, 7:54pm

The thing is that for this synthetic dataset, all labels, node features and edge features are random and do not have any semantic meaning. In such case, it should be expected that the model won’t work at all.