Context: I’m working with heterogeneous graphs that are too large to fit on a single machine. Thus, DistDGL is an extremely attractive framework that I’m exploring!
I see that DGL has support for performing graph partitioning in a distributed manner, via ParMETIS, which is great. However, in the docs I see this:
DGL provides a script named convert_partition.py, located in the tools directory, to convert the data in the partition files into :class:dgl.DGLGraph objects and save them into files. Note: convert_partition.py runs in a single machine. In the future, we will extend it to convert graph data in parallel across multiple machines.
Does this imply that the graph needs to be small enough to fit on a single machine? eg convert_partition.py is a “memory bottleneck” for very-large-graph processing?
convert_partition.py loads one partition at a time when constructing DGLGraph for each partition. Here we assume that we can load one partition into memory. This assumption aligns with distributed training, in which we load a partition to memory in each machine.
A follow-up question: does the machine that partitions the graph (eg via dgl.distributed.partition_graph()) have to be able to load the entire graph into memory in order to do the partitioning? eg partition_graph.py is a “memory bottleneck” for extremely large graphs?