The standalone mode can only work with the graph data with one partition

navmarri · February 16, 2021, 11:00pm

I’m trying to perform distributed training on an instance with 8GPUs. I have a graph with 8.5M nodes and 1.2B edges. I created 8 partitions

dgl.distributed.partition_graph(graph, 'graph_partition', 8, 'partitions/')

When I try to load the partitoned graph into memory, I get the following error

>>> g = dgl.distributed.DistGraph('graph_partition', part_config='partitions/graph_partition.json')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 390, in __init__
    'The standalone mode can only work with the graph data with one partition'
AssertionError: The standalone mode can only work with the graph data with one partition

Is distributed training supported for single instance with multiple GPUs?

VoVAllen · February 18, 2021, 8:49am

Please follow https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_multi_gpu.py when you are running on single machine with multi-gpu. Standalone mode is mainly for debugging usage.

navmarri · February 19, 2021, 10:06pm

This example doesn’t deal with DistDataLoader. I’m getting out of memory issues when training using dgl.dataloading.EdgeDataLoader on a graph with 8M nodes and 1.2B edges. My machine has 480GB.
I’m wondering if I can levearge partition graph to train in distributed fashion.But looks like it is not supported in single machine(multiple GPUs). Is that the case?

VoVAllen · February 22, 2021, 7:17am

Partition won’t help in this case since you always need to load the whole graph in the memory. Which code are you following for multi GPU training?
Are you OOM on cpu memory or gpu memory?

system · March 24, 2021, 7:18am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.