Hello,
I am following the 0.9.1 release highlight to perform distributed partitioning of large graphs, specifically ogb datasets mag240 and papers100M and I have several questions:
-
The
chunk_graph.py
process is single node and currently the documentation mentions that “DGL currently requires the number of data chunks and the number of partitions to be the same.”
It seems to me that having access to distributed system requiring that the chunking is done on a single host doesn’t do good use of the resources. Are there plans to change that and also remove the number of chunks condition? Currently if interested to evaluate scale out performance (ie 2-4-8-16-32 hosts) means having to chunk and store the data 5 times. -
I have managed to generate the data chunks for both MAG420 and papers100M. The next step in the example is running
/workspace/dgl/tools/partition_algo/random_partition.py
This step is also single node. The memory requirements are reduced compared to pytorch distributed graphsage example where the full graph needed to fit in RAM, but again, having access to the distributed environment this seems to not do a good use of resources
My understanding is that thisrandom_partition.py
could be replaced with parmetis to benefit from the distributed environment but I couldn’t find an easy to follow example using the generated chunked data. If there is one, could you please point me to it? -
Finally my last question, it seems that
data_dispatching.py
benefits from parallel data processing and all hosts will write their partition into a shared ‘nfs’ folder. Why is nfs needed? Isn’t it the goal of distributed training that each node only needs access to their part and anything else would be received from other nodes?
Thanks!!!