Distributed or Parallel Training Docs/Guide

Pangurbon · March 25, 2019, 12:42pm

Hi! I’m working on a couple new projects using DGL that I have training questions about. In early experimentation I’ll be training on my home desktop, so distributed training isn’t an issue immediately, but it will be important in the future. Here’s my use cases:

Project 1:
Many graphs, no more than 100-2,000 nodes, 2-5k edges max, few features but both edge and node features desired.

Project 2:
Flexible sizing, graph as a whole will be 100k+ nodes, 250kish edges, 0-1,000+ features on nodes for the most part. Graph can and definitely will be split down to very small (100-1,000 nodes) if necessary but larger the better.

My issue is training after initial testing. I have access to a good sized HPC cluster, running on SLURM. We schedule jobs on Singularity containers. GPU nodes have V100s for the most part. Is there a good path for training around that setup? It will be quite nice to train across multiple nodes when they’re available, but they are accessed through separate containers. I understand this may not be an issue to be solved directly with DGL, I’m certainly still a novice so if DGL doesn’t or won’t handle this kind of stuff I’d love a pointer in the right direction.

I’ve looked into Ray and RLLib, for the first project in particular, but I’m not entirely sure how Ray works with workers that have different lifetimes, and waiting for scheduled slots etc. I will ask them for more help if that’s a good idea, but if DGL can work around this easily that’d save a lot of effort. PyTorch also has some distributed tools if that’s the place to look, but I know DGL sits on top of other libraries so the integration might not be so simple.

Thanks in advance, I know this kind of question is a little general and I appreciate any help!

minjie · March 25, 2019, 3:21pm

Hi @Pangurbon, these are interesting scenarios and thank you for sharing.

Project 1 is a typical many-graph case so we recommend data parallelism – each nodes compute on different graphs (batches) and their parameters are synced after each SGD iteration. DGL can work with torch.distributed (see codes in the transformer example), and we use it for multi-GPU training on single machine. For multi machines, I’m not familiar on how this is setup, but I guess this is a good starting point.

Another option is using MXNet, which provides you many deployment solutions with different scheduling system. @zhengda1936 may have more experiences on this.

Project 2 is interesting. I did some brief calculation. For 100k+ nodes with 1K features, the total memory consumption for node features is 100M * 4B = 400MB if the feature is of float type. Suppose you have ten layers (including activation) that need to backprop gradients, then the max memory consumption is ~4GB which is fine compared with a 16GB V100 card. If you don’t want to split down the graph to very small subgraphs, then loading the full graph to GPU looks a good solution to me. If you still want to split it down, then I think the solution to Project 1 should apply.

Hope this help!