Hi! I’m working on a couple new projects using DGL that I have training questions about. In early experimentation I’ll be training on my home desktop, so distributed training isn’t an issue immediately, but it will be important in the future. Here’s my use cases:
Project 1:
Many graphs, no more than 100-2,000 nodes, 2-5k edges max, few features but both edge and node features desired.
Project 2:
Flexible sizing, graph as a whole will be 100k+ nodes, 250kish edges, 0-1,000+ features on nodes for the most part. Graph can and definitely will be split down to very small (100-1,000 nodes) if necessary but larger the better.
My issue is training after initial testing. I have access to a good sized HPC cluster, running on SLURM. We schedule jobs on Singularity containers. GPU nodes have V100s for the most part. Is there a good path for training around that setup? It will be quite nice to train across multiple nodes when they’re available, but they are accessed through separate containers. I understand this may not be an issue to be solved directly with DGL, I’m certainly still a novice so if DGL doesn’t or won’t handle this kind of stuff I’d love a pointer in the right direction.
I’ve looked into Ray and RLLib, for the first project in particular, but I’m not entirely sure how Ray works with workers that have different lifetimes, and waiting for scheduled slots etc. I will ask them for more help if that’s a good idea, but if DGL can work around this easily that’d save a lot of effort. PyTorch also has some distributed tools if that’s the place to look, but I know DGL sits on top of other libraries so the integration might not be so simple.
Thanks in advance, I know this kind of question is a little general and I appreciate any help!