Questions about efficiency in distributed computing

BearBiscuit05 · November 18, 2022, 1:08am

Hello, thank you very much for seeing this question. Recently, when using DGL’s distribution, I encountered some places that I don’t understand. I hope someone can help me answer it.

For the partitions of the metis and random algorithms, I saw such outputs when I ran them. According to my understanding, it should indicate that the vertices and samples are completely in this partition. Then I don’t quite understand why the random algorithm that seems to have a better hit rate takes longer to calculate.

#metis
part 0, train: 38358 (local: 34303), val: 5958 (local: 5958), test: 13926 (local: 13926)
part 1, train: 38358 (local: 36497), val: 5958 (local: 5314), test: 13926 (local: 12176)
…
#random
part 0, train: 38358 (local: 38358), val: 5958 (local: 5844), test: 13926 (local: 13920)
part 2, train: 38358 (local: 38152), val: 5958 (local: 5958), test: 13926 (local: 13750)
…

After running the metis algorithm partition and graphsage multiple times, I found that the hit rate of each vertex seems to be different every time. Of course, the time efficiency is better when the hit rate is high. But I’m curious what causes the difference in the partition hit ratio. In my understanding, the metis algorithm and the training set and verification set are all determined, so shouldn’t the result of the partition be the same every time?

3.Sampling node, I always think that the process of sampling and feature extraction occurs in dataloader = DistDataLoader(args), but according to the running situation, it seems that feature extraction occurs in the process of enumerate(dataloader)? I probably don’t know much about the internal implementation of this.

Rhett-Ying · November 24, 2022, 2:48am

how do get the conclusion that the random algorithm that seems to have a better hit rate takes longer to calculate? according to the log you pasted? Are you using dgl.distributed.partition_graph()?
METIS partition is not exactly determined. Randomness exist as I know.
dataloder = DistDataLoader() is just instantiating dataloader, sampling is processed when we iterating the dataloader. see more details in dgl/dist_dataloader.py at 0cb5f0fdc00e1e6a040084c96ec9f16c532bed65 · dmlc/dgl · GitHub

BearBiscuit05 · November 24, 2022, 3:28am

Thanks for the reply, 1 is just a question of mine, because I’m not sure about the specific meaning of this output. I guess local:xxx means that after the graph is partitioned, the training vertices are exactly located in the data read by the local machine.
So it seems that, under the random algorithm, there are more vertices in the right machine at the beginning. The specific details are in the example\pytorch\graphsage\dist\train_dist.py main().

Rhett-Ying · November 24, 2022, 5:26am

right machine is not correct or confusing. larger number of local does not mean better performance during distributed train. METIS partitions according to dst nodes and this will benefit sampling stage (in this stage, we sample src nodes of seeds) while RANDOM does not take this into consideration.

BearBiscuit05 · November 24, 2022, 5:55am

Thanks, I understand.

system · December 24, 2022, 5:55am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.