Edge_split consumes too much memory

paramecinm · July 6, 2021, 4:12am

According to this doc, I am using edge_split to create training dataset for distributed training.
I found that the split_even method called by edge_split is creating nonzero_1d tensor by calling

eles = F.nonzero_1d(F.tensor(elements))

Our graph has 13 billion edges for training, so the variable elements is a 13 billion length boolean list. F.tensor consumes 13billion*1byte~=12GB memory, F.nonzero_1d consumes 13billion*8byte~=96GB memory. So there is in total 108GB memory consumption which cannot be decreased by adding more training workers.

Does DGL has any plan for optimizing this part? If not, can I try to optimize and contribute it?

classicsong · July 12, 2021, 7:43am

If you can contribute it, it will be great.
Can you give a general design of your solution?

paramecinm · July 12, 2021, 4:39pm

I have created a PR about this, we can discuss further there. Thank you.

system · August 11, 2021, 4:39pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.