RuntimeError in link prediction example of RGCN

vamships · February 20, 2020, 3:59pm

Hi,

I have been trying out some of the examples on R-GCN and ran into an issue with link prediction. Particularly, I followed the example scripts here, and was able to successfully complete all the examples under entity classification. However, when running the link prediction example on the FB15K-237 dataset, I got the following error in the evaluation phase.

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1088460000 bytes. Error code 12 (Cannot allocate memory)

I am running using a GPU machine with CUDA 10.1. Any help with resolving this issue will be greatly appreciated. A similar issue was reported in the forum a couple of months ago.

Below is the complete log of output from the run:

Namespace(dataset='FB15k-237', dropout=0.2, edge_sampler='uniform', eval_batch_size=500, evaluate_every=500, gpu=0, grad_norm=1.0, graph_batch_size=30000, graph_split_size=0.5, lr=0.01, n_bases=100, n_epochs=6000, n_hidden=500, n_layers=2, negative_sample=10, regularization=0.01)
# entities: 14541
# relations: 237
# edges: 272115
Test graph:
start training...
Epoch 0100 | Loss 0.2027 | Best MRR 0.0000 | Forward 0.2048s | Backward 0.4030s
Epoch 0200 | Loss 0.1262 | Best MRR 0.0000 | Forward 0.2030s | Backward 0.4027s
Epoch 0300 | Loss 0.1040 | Best MRR 0.0000 | Forward 0.2026s | Backward 0.4020s
Epoch 0400 | Loss 0.0927 | Best MRR 0.0000 | Forward 0.2038s | Backward 0.4028s
Epoch 0500 | Loss 0.0859 | Best MRR 0.0000 | Forward 0.2037s | Backward 0.4041s
start eval
Traceback (most recent call last):
  File "link_predict.py", line 348, in <module>
    main(args)
  File "link_predict.py", line 239, in main
    embed = model(test_graph, test_node_id, test_rel, test_norm)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "link_predict.py", line 92, in forward
    return self.rgcn.forward(g, h, r, norm)
  File "/home/vamship/sample-graph-analysis/sample_graph_analysis/rgcn_sample/model.py", line 57, in forward
    h = layer(g, h, r, norm)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 180, in forward
    g.update_all(self.message_func, fn.sum(msg='msg', out='h'))
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/graph.py", line 2747, in update_all
    Runtime.run(prog)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/runtime/runtime.py", line 11, in run
    exe.run()
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/runtime/ir/executor.py", line 204, in run
    udf_ret = fn_data(src_data, edge_data, dst_data)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/runtime/scheduler.py", line 949, in _mfunc_wrapper
    return mfunc(ebatch)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/nn/pytorch/conv/relgraphconv.py", line 144, in bdd_message_func
    node = edges.src['h'].view(-1, 1, self.submat_in)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/utils.py", line 285, in __getitem__
    return self._fn(key)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/frame.py", line 655, in <lambda>
    return utils.LazyDict(lambda key: self._frame[key][rows], keys=self.keys())
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/frame.py", line 97, in __getitem__
    return F.gather_row(self.data, user_idx)
  File "/home/vamship/.conda/envs/sample-graph-analysis/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 152, in gather_row
    return th.index_select(data, 0, row_index)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1088460000 bytes. Error code 12 (Cannot allocate memory)

classicsong · February 27, 2020, 2:46am

The evaluation will consume about 40+G memory. Does your machine have enough memory?

vamships · February 27, 2020, 3:01pm

No, I am using a Tesla V100, which has 16G RAM. I was able to run the RGCN code from the authors on the same machine, but not the DGL version.

I am curious to know your reasoning for the 40GB memory requirement.

classicsong · February 28, 2020, 12:34am

The 40+G is CPU memory. Let me find out why it needs some many CPU memory.

classicsong · March 2, 2020, 3:12am

The evaluation is done by CPU with full graph run instead of mini-batch run.

hackerchenzhuo · March 20, 2020, 9:51am

hi.
so how could I use mini-batch run during evaluation?
I also have this question.
or Can I use the GPU for evaluation?

mufeili · March 20, 2020, 5:46pm

This post seems to be a duplicate of #564.