The standalone mode can only work with the graph data with one partition

I’m trying to perform distributed training on an instance with 8GPUs. I have a graph with 8.5M nodes and 1.2B edges. I created 8 partitions

dgl.distributed.partition_graph(graph, 'graph_partition', 8, 'partitions/')

When I try to load the partitoned graph into memory, I get the following error

>>> g = dgl.distributed.DistGraph('graph_partition', part_config='partitions/graph_partition.json')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/dgl/distributed/dist_graph.py", line 390, in __init__
    'The standalone mode can only work with the graph data with one partition'
AssertionError: The standalone mode can only work with the graph data with one partition

Is distributed training supported for single instance with multiple GPUs?

Please follow https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_sampling_multi_gpu.py when you are running on single machine with multi-gpu. Standalone mode is mainly for debugging usage.

This example doesn’t deal with DistDataLoader. I’m getting out of memory issues when training using dgl.dataloading.EdgeDataLoader on a graph with 8M nodes and 1.2B edges. My machine has 480GB.
I’m wondering if I can levearge partition graph to train in distributed fashion.But looks like it is not supported in single machine(multiple GPUs). Is that the case?

Partition won’t help in this case since you always need to load the whole graph in the memory. Which code are you following for multi GPU training?
Are you OOM on cpu memory or gpu memory?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.