CUDA out of memory at BP

Hi, I have a dgl.batched_graph.BatchedDGLGraph with 130000 nodes 16 million edges . Each node has a 100-dimension feature. And I only used builtin functions. Here’s the problem I met during BP:

  File "/root/try/bishe/parser/cmds/cmd.py", line 92, in train
    loss.backward()
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/torch/autograd/function.py", line 77, in apply
    return self._forward_cls.backward(self, *args)
  File "/root/Anacondas/anaconda3/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 355, in backward
    grad_rhs = grad_out.new_empty((rhs_data_nd.shape[0],) + feat_shape)
RuntimeError: CUDA out of memory. Tried to allocate 6.27 GiB (GPU 0; 7.93 GiB total capacity; 2.40 GiB already allocated; 3.62 GiB free; 2.60 GiB reserved in total by PyTorch)

I wonder how to solve this problem. Any suggestions or tips would be appreciated.

Hi, I think the graph and the model is still too big to fit into one GPU.
One solution is not to batch all these graphs together but to batch a subset of them at one time, for example, in each iteration, feed a batched graph with 13000 nodes and 1.6 million edges into the GNN model, and use gradient accumulation to simulate a larger batch size.

Thank you for your suggestion!