Memory leak in sparse.py?

svra278 · May 13, 2021, 4:37pm

Hi,

I’m trying to run a GNN training code using DGL. It appears that there is a memory leak in sparse.py.
I used tracemalloc to capture the top differences in memory allocation in Python code over 11 iterations and observed that only the line related to sparse.py shows continuous increase in allocated memory blocks.

Iteration 1
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=3086 B (+3086 B), count=21 (+21), average=147 B

Iteration 2
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=5214 B (+2968 B), count=35 (+19), average=149 B

Iteration 3
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=7342 B (+2968 B), count=49 (+19), average=150 B

Iteration 4
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=9470 B (+2968 B), count=63 (+19), average=150 B

Iteration 5
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=11.3 KiB (+2968 B), count=77 (+19), average=151 B

Iteration 6
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=13.4 KiB (+2968 B), count=91 (+19), average=151 B

Iteration 7
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=15.5 KiB (+2968 B), count=105 (+19), average=151 B

Iteration 8
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=17.6 KiB (+2968 B), count=119 (+19), average=151 B

Iteration 9
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=19.6 KiB (+2968 B), count=133 (+19), average=151 B

Iteration 10
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=21.7 KiB (+2968 B), count=147 (+19), average=151 B

Iteration 11
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=23.8 KiB (+2968 B), count=161 (+19), average=151 B

Over 11 iterations, you can see that sparse.py allocation has gone up from 3KB to nearly 24KB. Due to this, the training run crashes after 10000 or so iterations due to running out of memory. The main primitive being used is update_all(). We also tried with multi_update_all() and the same thing is happening.

Thanks for your help.

zhengda1936 · May 14, 2021, 3:42am

is it possible for you to provide the source code to reproduce the bug?

zihao · May 15, 2021, 2:02pm

Would you mind reporting your PyTorch version?

svra278 · May 15, 2021, 5:50pm

The PyTorch version I’m using is 1.9

zihao · May 26, 2021, 2:44pm

I’m not sure if DGL is compatible w/ PyTorch 1.9 (which is still in its beta stage), how about downgrading PyTorch to v1.8?

system · June 25, 2021, 2:45pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.