Distributed partitioning v0.9.1 release for distrbuted GNN training

Hello,
I am following the 0.9.1 release highlight to perform distributed partitioning of large graphs, specifically ogb datasets mag240 and papers100M and I have several questions:

  1. The chunk_graph.py process is single node and currently the documentation mentions that “DGL currently requires the number of data chunks and the number of partitions to be the same.”
    It seems to me that having access to distributed system requiring that the chunking is done on a single host doesn’t do good use of the resources. Are there plans to change that and also remove the number of chunks condition? Currently if interested to evaluate scale out performance (ie 2-4-8-16-32 hosts) means having to chunk and store the data 5 times.

  2. I have managed to generate the data chunks for both MAG420 and papers100M. The next step in the example is running /workspace/dgl/tools/partition_algo/random_partition.py
    This step is also single node. The memory requirements are reduced compared to pytorch distributed graphsage example where the full graph needed to fit in RAM, but again, having access to the distributed environment this seems to not do a good use of resources
    My understanding is that this random_partition.py could be replaced with parmetis to benefit from the distributed environment but I couldn’t find an easy to follow example using the generated chunked data. If there is one, could you please point me to it?

  3. Finally my last question, it seems that data_dispatching.py benefits from parallel data processing and all hosts will write their partition into a shared ‘nfs’ folder. Why is nfs needed? Isn’t it the goal of distributed training that each node only needs access to their part and anything else would be received from other nodes?

Thanks!!!

  1. In latest DGL(master branch), DGL supports any number of chunks that is larger or equal than number of partitions. As for the chunked data generation, they could be generated by multiple host and this depends on users though chunk_graph.py is running on single machine. chunk_graph.py is just an example for tutorial, it’s not expected to help chunk graph via utilizing multiple machines. User should be responsible for the data chunk/generation process.
  2. random_partition.py is quite simple: assign partition IDs to each node randomly. So it’s ok to do on a single machine. As for ParMETIS, it’s still in progress and not fully ready. I think it will be ready in next release.
  3. I think it’s ok that each machine saves partitioned graph on its own disk, just under same work_dir. NFS is not really necessary.