Diffpool seems to be very memory inefficient? i run out of memory even with small batch sizes?

LewKrew · October 9, 2020, 1:17pm

I am trying to use dgl’s diffpool for graph classification on my own dataset. i am using this implementation and only replacing the train.py with my own code :

but the problem is i can’t even finish one epoch without running out of VRAM, even tho i have 6 GB of VRAM! even setting the batch size to 1 or 2 didn’t help. this is the error :

/dgl/diffpool/model/dgl_layers/gnn.py", line 131, in forward
    current_lp_loss = torch.norm(adj.to_dense() -
RuntimeError: CUDA out of memory. Tried to allocate 1.54 GiB (GPU 0; 5.93 GiB total capacity; 3.55 GiB already allocated; 1.10 GiB free; 3.83 GiB reserved in total by PyTorch)

Is this normal? why is it running out of memory when the batch size is as small as 2 or 1?

I dont think there is a problem with my code because i have used the very same code (only a little different) on other architectures as well like GCN and others, and i never ran out of VRAM in this dataset even with batch_size > 64.

I am also using the default parameters :

                pool_ratio=0.15,
                num_pool=1,
                cuda=0,
                lr=1e-3,
                clip=2.0,
                batch_size=2,
                epoch=100,
                train_ratio=0.7,
                test_ratio=0.1,
                n_worker=1,
                gc_per_block=3,
                dropout=0.0,
                method='diffpool',
                bn=True,
                bias=True,
                save_dir="./model_param",
                load_epoch=-1,
                data_mode='default'

VoVAllen · October 12, 2020, 7:32am

Hi,

This is expected, because diffpool involves lots of dense computation. The adjacency matrix after pooling is a dense matrix. If you batch 30 graphs about 300 nodes per graph, the new adj matrix would be about 10000*10000 and be dense, therefore it consumes lots of memory. Other GNN models are most sparse computation, which is highly optimized by dgl

LewKrew · October 16, 2020, 8:06am

But i can’t even use it even with 1 batch, there seems to be a problem because i ran out of memory after 120-130 epochs, and not at the start, which means the memory is not getting cleaned up or something, because after each batch i assume the VRAM should get freed or something, i am using the same training code that i have tried with other GNN models so i doubt there is a problem in my code.

VoVAllen · October 19, 2020, 7:51am

Can you raise an issue at dgl’s github repo? It’s possible that the bug is inside dgl. Also did you do any modification on the original code?