CUDA out of memory

I have a graph with 88830 nodes and 1.8M edges.
I’m following unsupervised Graphsage tutorial. Feature size is 2048
I’m getting CUDA out of memory exception.

DGLGraph(num_nodes=88830, num_edges=1865430,
         ndata_schemes={}
         edata_schemes={})
Traceback (most recent call last):
  File "graphsage.py", line 233, in <module>
    train()
  File "graphsage.py", line 213, in train
    loss = model(train_g, img_features, color_features, neg_sample_size)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 125, in forward
    pos_score = score_func(pos_g, emb)
  File "graphsage.py", line 89, in score_func
    pos_tails = emb[dst_nid]
RuntimeError: CUDA out of memory. Tried to allocate 1.14 GiB (GPU 0; 11.17 GiB total capacity; 9.14 GiB already allocated; 1018.06 MiB free; 9.15 GiB reserved in total by PyTorch)

EDIT:
Since the machine has 8 GPUs, I try to use the model = nn.Dataparallel(model) module.
I get the following error.

Traceback (most recent call last):
  File "graphsage.py", line 233, in <module>
    train()
  File "graphsage.py", line 213, in train
    loss = model(train_g, img_features, color_features, neg_sample_size)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
dgl._ffi.base.DGLError: Caught DGLError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 123, in forward
    emb = self.gconv_model(g, features)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "graphsage.py", line 44, in forward
    h = layer(g, h)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/nn/pytorch/conv/sageconv.py", line 113, in forward
    graph.ndata['h'] = feat
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/view.py", line 65, in __setitem__
    self._graph.set_n_repr({key : val}, self._nodes)
  File "/home/ubuntu/anaconda3/envs/pyt1.4/lib/python3.7/site-packages/dgl/graph.py", line 1790, in set_n_repr
    ' Got %d and %d instead.' % (nfeats, num_nodes))
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 11104 and 88830 instead.

I can see that the data is divided into 8 batches 88830 / 8 ~= 11104 But I’m not sure how to get past this error.

Can someone help me on this?

It looks like there’s already 11 GB of memory taken on the GPU. Are you running in a Jupyter notebook environment and perhaps have a lot of stuff you don’t need still in memory? In that case, restarting the kernel could help. An nvidia-smi command to the terminal might help you locate processes that are causing the issue.

If that doesn’t help: I’m not as familiar with PyTorch, but maybe you can store the graph on CPU context, and then only transfer the batch from CPU to GPU during training. In MXNet, this would be done using <tensor>.as_in_context(mx.gpu()).

But the traceback is a little confusing to me because it looks like you’re just instantiating the Graph object, but the traceback shows the evaluation of a model. Maybe it’s not the graph itself that’s killing your memory, but the features associated with it. If, for instance, you’re One-Hot-Encoding 88k nodes, that’s a huge tensor. Since things can be lazily executed, maybe you’re just not seeing the real memory impact until you try to do the computation.

Hi, @navmarri,
This first problem was because computation on this graph is too big to fit into GPU memory.
The second problem was because DGL do not support PyTorch Dataparallel api (which partition the input tensor on the first dimension and dispatch each part into different GPUs, however, for GNN applications you have to partition graphs), you need to launch processes and partition the graph manually, and use torch.distributed for multi-GPU training.

We have a PR on multi-gpu training of GNNs: https://github.com/dmlc/dgl/pull/1143/files, where we use a separate NeighborSampler for each processes, you can refer to the code there.

Hi @navmarri,

Your problem is that the Graph is too big to fit into GPU memory, the following method could help:

  1. Change GPU environment to CPU and uses sampler component, please see this example: https://github.com/dmlc/dgl/tree/master/examples/mxnet/sampling

  2. Partition Graph into different GPU and uses multi-GPU training, please see this example: https://github.com/dmlc/dgl/pull/1143/files

Note that the multi-GPU training code is under review, and you can wait our new release for it.

Thank you!

1 Like