Hi,
I’m trying to run a GNN training code using DGL. It appears that there is a memory leak in sparse.py.
I used tracemalloc to capture the top differences in memory allocation in Python code over 11 iterations and observed that only the line related to sparse.py shows continuous increase in allocated memory blocks.
Iteration 1
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=3086 B (+3086 B), count=21 (+21), average=147 B
Iteration 2
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=5214 B (+2968 B), count=35 (+19), average=149 B
Iteration 3
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=7342 B (+2968 B), count=49 (+19), average=150 B
Iteration 4
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=9470 B (+2968 B), count=63 (+19), average=150 B
Iteration 5
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=11.3 KiB (+2968 B), count=77 (+19), average=151 B
Iteration 6
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=13.4 KiB (+2968 B), count=91 (+19), average=151 B
Iteration 7
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=15.5 KiB (+2968 B), count=105 (+19), average=151 B
Iteration 8
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=17.6 KiB (+2968 B), count=119 (+19), average=151 B
Iteration 9
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=19.6 KiB (+2968 B), count=133 (+19), average=151 B
Iteration 10
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=21.7 KiB (+2968 B), count=147 (+19), average=151 B
Iteration 11
…/lib/python3.8/site-packages/dgl-0.7-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py:307: size=23.8 KiB (+2968 B), count=161 (+19), average=151 B
Over 11 iterations, you can see that sparse.py allocation has gone up from 3KB to nearly 24KB. Due to this, the training run crashes after 10000 or so iterations due to running out of memory. The main primitive being used is update_all(). We also tried with multi_update_all() and the same thing is happening.
Thanks for your help.