Trying to train GraphSage for huge graph on multiple GPUs - Getting stuck

I am trying to train the unsupervised example of GraphSage on my Graph:
70 Million nodes
5.2 Billion edges (2.6 to be exact duplicated to represent undirected).

If I initiate with 2 GPU and 25K batch size - training seems to start but super slow and will never finish.
So I tried increasing to 8 GPU - but when doing so training doesn’t even begin - it gets stuck on the EdgeDataLoader enumeration.
Maybe something to do with CPU memory?

Appreciate if someone has some guidance.


Could you try removing the exclude= and reverse_eids= arguments in the EdgeDataLoader and see if your code runs? Excluding the edges in the minibatch from the neighbors are particularly time-consuming.

It didn’t help.
Trying to train using more than 2 GPUs gets the process to get stuck.
I think it’s memory issue in the CPU. I see that when running the max of 3 GPU, the CPU memory is at max.

I see. What is your hardware configuration? (Memory size, # GPUs, # CPUs, etc.)

80 CPUs - 1 TB memory
8 GPUs 32GB each

I tried running examples/pytorch/graphsage/ with a random 700K-node and 26M-edge graph. I did not observe CPU memory consumption going up noticeably when increasing the number of GPUs from 1 to 4 (it stayed around 20GB). Did you observe the CPU memory consumption linearly going up against the number of GPUs?

Also, what is the DGL version?

Yes it goes linearly up - with each GPU. I am using 0.5.x version.

Maybe it has to do with the graph size? Maybe a bigger random graph with an average degree of 400~ would behave the same…

I tried a bigger machine with a bigger graph and I can confirm that there is indeed a scaling up going on:
1 GPU - 636GB
2 GPUs - 748GB
4 GPUs - 959GB

Seems that some redundant computation is taking place…