Memory leak in sparse.py?

Hi,

I’m trying to run a GNN training code using DGL. It appears that there is a memory leak in sparse.py.
I used tracemalloc to capture the top differences in memory allocation in Python code over 11 iterations and observed that only the line related to sparse.py shows continuous increase in allocated memory blocks.

Iteration 1
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=3086 B (+3086 B), count=21 (+21), average=147 B

Iteration 2
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=5214 B (+2968 B), count=35 (+19), average=149 B

Iteration 3
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=7342 B (+2968 B), count=49 (+19), average=150 B

Iteration 4
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=9470 B (+2968 B), count=63 (+19), average=150 B

Iteration 5
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=11.3 KiB (+2968 B), count=77 (+19), average=151 B

Iteration 6
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=13.4 KiB (+2968 B), count=91 (+19), average=151 B

Iteration 7
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=15.5 KiB (+2968 B), count=105 (+19), average=151 B

Iteration 8
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=17.6 KiB (+2968 B), count=119 (+19), average=151 B

Iteration 9
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=19.6 KiB (+2968 B), count=133 (+19), average=151 B

Iteration 10
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=21.7 KiB (+2968 B), count=147 (+19), average=151 B

Iteration 11
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=23.8 KiB (+2968 B), count=161 (+19), average=151 B

Over 11 iterations, you can see that sparse.py allocation has gone up from 3KB to nearly 24KB. Due to this, the training run crashes after 10000 or so iterations due to running out of memory. The main primitive being used is update_all(). We also tried with multi_update_all() and the same thing is happening.

Thanks for your help.

is it possible for you to provide the source code to reproduce the bug?

Would you mind reporting your PyTorch version?

The PyTorch version I’m using is 1.9

I’m not sure if DGL is compatible w/ PyTorch 1.9 (which is still in its beta stage), how about downgrading PyTorch to v1.8?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.