Can I use a 2d parameter variable of pytorch as the graph of the dgl

I’m wondering if I can use a 2d parameter variable in pytorch as the input graph of the dgl model. The code is like this:

class Encoder(nn.Module):

    def __init__(self, input_size: int, hidden_size: int, T: int):
        input size: number of underlying factors (81)
        T: number of time steps (10)
        hidden_size: dimension of the hidden state
        super(Encoder, self).__init__()
        self.num_of_node = num_of_nodesize

        # the adj mat
        self.W = nn.Parameter(torch.randn(input_size, input_size), requires_grad=True)

        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

        # obtain the adj matrix
        nx_graph = nx.from_numpy_matrix(self.W.detach().cpu().numpy()) = DGLGraph(nx_graph) 
        # define a gat model
        heads = ([8] * 1) + [1]
        self.gat = GAT(,


    def forward(self, input_data):

        # input_data: (batch_size, num_of_node,  feature_dim)

        input_weighted = torch.zeros(input_data.size(0), self.num_of_node).cuda()
        for i in range(input_data.shape[0]):
            input_weighted[i] = self.gat(input_data[i]))[:, 0]

     return input_weighted

The output will be used furtherly to calculate a loss. Although the code does not yeild any error, it seems the whole codes would take up lots of GPU memory. I’m not sure where is wrong.

Thanks in advance!

Shanchao Yang

You can if you really want, but several details need to be carefully handled.

  1. WIth torch.randn(input_size, input_size), each entry of the returned tensor is sampled from a standard normal distribution, which can be any float and is not necessarily positive. If we want to interpret this tensor as an adjacency matrix, then the graph is completely connected and asymmetric.
  2. I realize that you set self.W to be requires_grad. But in your current implementation this does not come into effect at all.
  3. With your current graph initialization approach, you end up with a completely connected DGLGraph without node/edge features.
  4. Graph Attention Networks compute attention over each edge. For a completely connected graph with N nodes, you have N^2 edges so the computation cost can be high. This is probably the reason why a lot of GPU memory is taken.

Hi, thanks for your reply.

  1. Thanks for your warning. I have changed the torch.randn to the uniform initialization.
  2. This is a toy example. In fact, self.W would be used to calculate the output.
  3. &4. Actually, current code can run for about ten iterations, and it would cause GPU out-of-memory error.
  result = self.forward(*input, **kwargs)
  File "/home/noone/Downloads/Compressed/da-rnn-master/", line 51, in forward
    h = self.gat_layers[l](g, h).flatten(1)
  File "/home/noone/anaconda3/envs/tf_3/lib/python3.6/site-packages/torch/nn/modules/", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/noone/anaconda3/envs/tf_3/lib/python3.6/site-packages/dgl/nn/pytorch/", line 285, in forward
    rst = self.activation(rst)
  File "/home/noone/anaconda3/envs/tf_3/lib/python3.6/site-packages/torch/nn/", line 1025, in elu
    result = torch._C._nn.elu(input, alpha)
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 10.91 GiB total capacity; 9.81 GiB already allocated; 10.75 MiB free; 20.50 KiB cached)

I’m not sure that I should put the graph initialization process in the init function, or in the forward function. Though GAT can be high in computation cost , for a problem with batch size being 128, a graph with 80 nodes, feature being 10 dimension, after a few training iterations, the gpu would be out of memory. I don’t think that the model is so complex and big. Is this possible that the computation graph of the model in pytorch keeps growing and growing or something in dgl is not deleted? Each time the model is called, I will construct a graph from curretn self.W parameter tensor, and use the same graph network to calculate the embeddings from a batch size of different features.

Which dgl version are you using? There used to be memory leak issue and now should be fixed.

I installed the 2.0 dgl following the instruction conda install -c dglteam dgl-cuda10.0, which I think would install the lastest one.

May I ask that do you agree with my code that a variable tensor, which needs gradient, can be used as the input graph for dgl? For each time the forward function is called, current self.W would be used to calculate the embedding from a set of given features, then to calculate the loss. Maybe this would cause the computaion graph for the whole model keeps growing, since more and more dgl models are in the whole b&p gradient thing.

I don’t think this is the problem. As long as you don’t refer to any variable in the previous iteration, it should be fine. Could share more codes with us for debugging?

It’s quite strange. I have uploaded the code. Thanks for your help.


I took a glance at your code and didn’t find the problem.

Could you try the following:

  • manually delete DGLGraph at the end of forward function with del gs
  • If there’s still memory leak problem, could you try substitute the Encoder module with a very simple one without dgl such as only a Linear Layer to see whether memory leak still happens?

I tried what you suggested.

  1. Even deleting DGLGraph can not work, so I think it’s because something is expanding the gradient computation graph of pytorch. I also tried another GAT implementation, it has the same issue.

  2. Replacing GAT model by using a simple linear layer can work.

So, does this mean if I want to use GAT model to model the relationships between multiple time series, I can’t simply use the current GAT libariry? +.+


It looks weird to me. Have you checked that the gpu consumption is stable with a simple linear layer? (You can try watch -n 3 nvidia-smi in bash)
Maybe it’s just because the increase is too small to raise OOM error.

After monitoring the gpu usage, I found that DGL indeed has a great possibility of memory leak. Since the linear layer and the GAT model from this implementation don’t have this problem. I don’t know where is wrong, since explicitly deleting the DGLGraph still doesn’t work.


I cloned your code and its training works well with the latest dgl on my machine. Predict function seems has memory leak issue, but by adding with torch.no_grad(): at, no leak anymore.

Could your uninstall it and reinstall it with pip install --pre dgl-cu100? And check the version with

import dgl

Many thanks. I have added the no_grad code, and there is no leak anymore.

Anyway, thanks again for your kind patience.

Also thanks for using dgl :slight_smile:

Your issue is important for us to discover potential bugs. Any suggestion or comment is welcomed.