Hi everyone:
Im following this tutorial and training a RGCN in a GPU: 5.3 Link Prediction — DGL 0.6.1 documentation
My graph is a batched one formed by 300 subgrahs and the following total nodes and edges:
Graph(num_nodes={‘ent’: 31167},
num_edges={(‘ent’, ‘rel_1’, ‘ent’): 29290, (‘ent’, ‘rel_2’, ‘ent’): 142290, (‘ent’, ‘rel_3’, ‘ent’): 20280})
When running training the model on the full graph I get this error:
model.to(device)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 852, in to
return self._apply(convert)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 530, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 552, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 850, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA out of memory. Tried to allocate 3.62 GiB (GPU 0; 11.91 GiB total capacity; 7.60 GiB already allocated; 3.48 GiB free; 7.62 GiB reserved in total by PyTorch)
I think the problem is that the hidden_dimension is, in this case, around 30000, since it is the number of nodes.
When training the model just in a 100 subrgraphs batched graph instead of on the full 300 I get another error:
Epoch 1/20:
Traceback (most recent call last):
File "rgcn.py", line 232, in <module>
loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
The limit appears to be 80 of this subgraphs, then I get no error.
Since this tutorial doesnt include batching during training, how could I use it? Could batching fix this? Is there any other way to fix it?
Since my graph is a Batched Graph, would it help the memory issues if I use this batches to train the model on every subgraph iteratively instead on the big graph? Since the hidden dimension must be the number of nodes, this could be a problem, since every subgraph has a different number of nodes.
The model parameters look like this:
model = Model(embeddings_dimensions, num_nodes, num_nodes, g.etypes)
Where the embeddings dimension is a 700 dimension tensor, and the num_nodes is 30000 for the full graph. Could this be the problem?
I installed the following cuda version: pip install dgl-cu101 -f https://data.dgl.ai/wheels/repo.html
Could install it from source help with this?
Thank you all.