Loss.backward() cause "shape mismatch" error when using Neighborsampler

I face a hard net to crack when using Neighborsampler to train a GNN.
When I try to backward, “loss.backward()” will cause the following error, which really perplex me.


Traceback (most recent call last):
  File "t.py", line 90, in <module>
    loss.backward()
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xxx/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: shape mismatch: value tensor of shape [5, 128] cannot be broadcast to indexing result of shape [50, 128]

Below is an example of my code. I create a small graph with 50 nodes and some random edges, and I try to use the form of NeighborSampler to train an ordinary GNN. I will do a binary classification with the node features.

The forward phase is normal, and the first backward is also ok. But the second backward will cause the above error.
Besides, when I experiment on a relative large graph (10W nodes, and 170W edges), it will take 3 minutes on “loss.backward()” in the first time, which I think is too slow. And the second backward will also cause the “shape mismatch” error. This error occurs both in CPU and GPU modes.
Is my code correct for using NeighborSampler to train a GNN?

Could someone help me? Thanks in advance!

DGL version : 0.4.1
Pytorch version: 1.0.0

import torch
import torch.nn as nn
import numpy as np
import dgl
import dgl.function as fn
from dgl.nn.pytorch.conv import GraphConv
from utils import *
from dgl.contrib.sampling.sampler import NeighborSampler
from IPython import embed

# build a random graph
g = dgl.DGLGraph()
g.add_nodes(50)


num_edges = 200

for _ in range(num_edges):
    a = random.randint(0, 49)
    b = random.randint(0, 49)
    if a==b or g.has_edge_between(a,b):
        continue
    g.add_edge(a, b)
    g.add_edge(b, a)



g1 = dgl.DGLGraph(g,readonly = True)
g1.readonly()
g = g1
g.ndata['h'] = torch.randn(50, 128)


# My GNN
class ReduceLayer(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(ReduceLayer, self).__init__()
        self.fc = nn.Linear(in_feat, out_feat)

    def forward(self, nodes):       # without using edge information 
        
        h = torch.mean(nodes.mailbox['m'], dim = 1)
 
        h = torch.cat((h, nodes.data['h']), dim = 1)
        h = self.fc(h)
        return {'h': h}



class Net(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(Net, self).__init__()
        self.reduce_func = ReduceLayer(2*in_feat,out_feat)

        
    def forward(self, nf):
        nf.copy_from_parent()

        # compute by blocks
        nf.register_reduce_func(ReduceLayer, 0)
        nf.block_compute(0, message_func = fn.copy_src(src = 'h', out = 'm'), reduce_func = self.reduce_func)

        nf.copy_to_parent()






net = Net(128,128)

loss_func = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(net.parameters())


# go forward
for nf in NeighborSampler(g, batch_size = 5, expand_factor = 3, shuffle = True, num_hops = 1):
    net(nf)
    

# try to backward
for i in range(5):
    optimizer.zero_grad()
    label = [0]*5 + [1] * 5 
    label = torch.tensor(label).float()
    out = g.ndata['h'][i*10:(i+1)*10]
    out = torch.sum(out, dim = 1)
  
    loss = loss_func(out, label)
    print(loss)
    # an error will occur at the second loop.
    loss.backward()
    optimizer.step()
    print('bp ok')
    input()

RuntimeError: shape mismatch: value tensor of shape [5, 128] cannot be broadcast to indexing result of shape [50, 128]

It seems that the shape of your label and model prediction is different.

The forward phase is normal, and the first backward is also ok. But the second backward will cause the above error.

This sounds like you didn’t manage to zero the gradient. Maybe you can try manually removing them and see if this works.

Besides, when I experiment on a relative large graph (10W nodes, and 170W edges), it will take 3 minutes on “loss.backward()” in the first time, which I think is too slow.

Did you try sampling? The graph is probably too large to update at once.

Thanks for your reply!

It seems that the shape of your label and model prediction is different.

Since the loss is successfully calculated, I don’t think the shape of my label and the model prediction are mismatch.

This sounds like you didn’t manage to zero the gradient. Maybe you can try manually removing them and see if this works.

“manually removing them” means removing the gradients? I have called “optimizer.zero_grad()”.

Did you try sampling? The graph is probably too large to update at once.

Yes, I only sample at most 5 neighbors of each node to aggregate. I think 3 minutes for loss backward maybe too long. = =

Now, I think the bug may be caused by the improper use of NeighborSampling API. I try to simplify the code and problem to find the bug from scratch. However, I find another very puzzling bug.
Below is my code.

import torch
import torch.nn as nn
import numpy as np
import dgl
import dgl.function as fn
from dgl.contrib.sampling.sampler import NeighborSampler
import random



g1 = dgl.DGLGraph()
g1.add_nodes(6)
g1.add_edges([0,0,1,1,2,2], [3,4,3,5,4,5])
g1.add_edges([3,4,3,5,4,5], [0,0,1,1,2,2])

g1.ndata['h'] = torch.randn(g1.number_of_nodes(), 128)
g1.readonly()


sampler = dgl.contrib.sampling.NeighborSampler(
        g1,               # the graph
        2,        # number of nodes to compute at a time, HACK 2
        2,                     # number of neighbors for each node
        2,                     # number of layers in GCN
        shuffle=False,         # whether to shuffle the seed nodes.  Should be False here.
    )


class Net(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(Net, self).__init__()
        self.fc = nn.Linear(in_feat,out_feat)

        
    def forward(self, nf):
        nf.copy_from_parent()
        nf.layers[-1].data['h'] = self.fc(nf.layers[-1].data['h'])
        nf.copy_to_parent()



net = Net(128,128)
optimizer = torch.optim.Adam(net.parameters())
loss_func = nn.BCEWithLogitsLoss()

for nf in sampler:
    labels = torch.tensor([1,0,1,0,1,0]).float()
    net(nf)
    out = g1.ndata['h']
    out = torch.sum(out, dim = 1)
    optimizer.zero_grad()
    loss = loss_func(out, labels)
    print(loss)
# an error occur at the second loop
    loss.backward()
    optimizer.step()
    input()

I just create a simple graph with 6 nodes and some edges, and pass the node features to a very simple MLP while in the form of NodeFlow of DGL.
When I call “loss.backward()” at the second time, it will cause the below error, which says that I try to backward twice.

xxx@100:~$ python b.py
tensor(8.7265, grad_fn=<MeanBackward1>)
1
tensor(7.8914, grad_fn=<MeanBackward1>)
Traceback (most recent call last):
  File "b.py", line 58, in <module>
    loss.backward()
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

But as you can see in my code, before each loss.backward(), I have called “net(nf)” to do forward propagation. I don’t know why it occurs this error. Well, I change the code to

loss.backward(retain_graph = True)

everything is ok. But when I change the network to


class Net(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(Net, self).__init__()
        self.reduce_func = ReduceLayer(2*in_feat,out_feat)

        
    def forward(self, nf):
        nf.copy_from_parent()

        # compute by blocks
        nf.register_reduce_func(ReduceLayer, 0)
        nf.block_compute(0, message_func = fn.copy_src(src = 'h', out = 'm'), reduce_func = self.reduce_func)

        nf.copy_to_parent()

It will cause the below error.

Traceback (most recent call last):
  File "a.py", line 211, in <module>
    loss.backward(retain_graph = True)
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/torch/autograd/function.py", line 76, in apply
    return self._forward_cls.backward(self, *args)
  File "/data/maokelong/anaconda3/lib/python3.6/site-packages/dgl/backend/pytorch/tensor.py", line 396, in backward
    = ctx.backward_cache
TypeError: 'NoneType' object is not iterable

I am really puzzled by the problems caused by “loss.backward()” when using NodeFlow to train GNN.

My question is a little long,thank you for your patience for reading it !

Why did you perform copy_to_parent? Do you also want to learn the embeddings? Also what kind of task are you working with?

I’ve modified the code and the code below seems to work:

import torch
import torch.nn as nn
import numpy as np
import dgl
import dgl.function as fn
from dgl.contrib.sampling.sampler import NeighborSampler
import random



g1 = dgl.DGLGraph()
g1.add_nodes(6)
g1.add_edges([0,0,1,1,2,2], [3,4,3,5,4,5])
g1.add_edges([3,4,3,5,4,5], [0,0,1,1,2,2])

g1.ndata['h'] = torch.randn(g1.number_of_nodes(), 128)
g1.readonly()


sampler = dgl.contrib.sampling.NeighborSampler(
        g1,               # the graph
        2,        # number of nodes to compute at a time, HACK 2
        2,                     # number of neighbors for each node
        2,                     # number of layers in GCN
        shuffle=False,         # whether to shuffle the seed nodes.  Should be False here.
    )


class Net(nn.Module):
    def __init__(self, in_feat, out_feat):
        super(Net, self).__init__()
        self.fc = nn.Linear(in_feat,out_feat)
        
    def forward(self, nf):
        nf.copy_from_parent()
        nf.layers[-1].data['h'] = self.fc(nf.layers[-1].data['h'])
        return nf.layers[-1].data['h']

net = Net(128,128)
optimizer = torch.optim.Adam(net.parameters())
loss_func = nn.BCEWithLogitsLoss()

for nf in sampler:
    labels = torch.tensor([1,0]).float()
    out = torch.sum(net(nf), dim = 1)
    optimizer.zero_grad()
    loss = loss_func(out, labels)
    print(loss)
# an error occur at the second loop
    loss.backward()
    optimizer.step()
    input()

Thank you for your reply!

Why did you perform copy_to_parent ? Do you also want to learn the embeddings? Also what kind of task are you working with?

Yes, I want to learn the embeddings. The initial embedding of each node is the average of several trainable word embeddings that I want to learn. I briefly introduce my task to show why I want to perform “copy_to_parent”.
%E6%8D%95%E8%8E%B7

I have 3 types nodes and two graphs as shown above. Two graphs share the same type2 nodes.
I want to first use GNN on graph1 to get representations of Type2 nodes, and set them to type2 nodes in graph2 as their initial representations. Then, I will use another GNN on graph2 to get the new representations of type2 and type3 nodes for a downstream task. I want to train two GNNs together so that the learned representations of type2 and type3 nodes in graph2 can be benefit from the structures of both graph1 and graph2.
So, I think using “copy_to_parent” can easily finish the transition of type2 nodes’ representations between two graphs. I can just use

# I have set type2 nodes both as the forward nodes in two graphs so that they can match.
g2.ndata['h'][:g1.number_of_nodes()] = g1.ndata['h'][:g1.number_of_nodes()] 

Besides, I think its easily to use "g.ndata[‘h’] " to get the tensors that I need. Thus, I use “copy_to_parent” to set the representations back to their parent.

Now, I change my code to obtain the initial embeddings of nodes from a “nn.Embedding” layer, and the code is work (your code is also work). I think it runs as I want now.
Thank you for your kind reply again! By the way, for my above task, do you have some other advice ?
Thank you!

I probably finally understand now what’s going on with the previous error “RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.”

With a learnable node representation, the node representations are part of the computation graph that requires gradients and for the second iteration the previous computation graph still exists, so you in fact need to detach these tensors from outdated computation graphs and reconstruct parameters that require gradients at the beginning of each iteration, which is not required for the first iteration.

Your modeling looks fine to me. It is more clean to initialize node representations from an embedding layer every time to avoid the computation graph issues I mentioned above.

I understand, thank you so much!!!