When I use multiple GPUs for training, can the graph change dynamically?

kurisu · April 8, 2020, 12:42pm

I want that the graph can change dynamically (such as add some nodes or edges dynamicly) when I use ddp for multi-GPU training. But I cannot find a way. Is there any way to do this?

BarclayII · April 8, 2020, 5:57pm

Hi,

Currently dgl.DGLGraph supports dynamically adding nodes and edges.

Could you explain what you would like to do with DDP in a bit more detail (and probably better refer me a paper)?

Thanks.

kurisu · April 9, 2020, 3:45am

Hi,

I want to train a large graph, and use ddp with multi-GPU for accelerated training. But the graph obtained in each subprocess of ddp is independent of the original graph (in the main process) based on copy-on-write. If I change the graph in ddp subprocesses dynamicly, I guess the memory needs to save the original graph and the changed graphs at the same time. And I am worried that the memory will not be enough.

So I am looking for a way to make the original graph and the graphs in each subprocess in ddp always share memory as the graph changes dynamicly.

Thank you!

BarclayII · April 10, 2020, 8:22am

As far as I’m awared of mutable graphs are currently not able to reside in shared memory (@zhengda1936 please confirm).

If the graph is shared between processes, what if two subprocesses add edges at the same time? Sounds like you will have write conflicts.

kurisu · April 10, 2020, 9:39am

I think only one of the subprocesses can be specified to change the graph.

BarclayII · April 10, 2020, 10:54am

Sounds like you need some locking mechanism between the per-GPU workers, which could be complicated.

I believe that you problem can be solved in an easier way. Could you please share your setting (e.g. which task were you solving, what was your graph, how does new edges come in, etc.)? A paper reference could be better.

Also, for large graph training we recommend minibatch-based approaches. That could significantly reduce the time and memory to train your model. You can refer to this tutorial or this code for details.