Parameters for best performance in Distributed GraphSAGE

Hello.

I am experimenting with DistDGL’s GraphSAGE implementation to get some performance numbers, and I want to make sure I am being fair to the system by making sure I have the runtime parameters configured correctly.

Given H physical machines which each have T cores, what is the best way to set the following parameters?

num_trainers
num_samplers
num_servers
num_workers

I’ve been running 1 for trainers, 4 for samplers, 1 for servers, and 4 for num_workers in the following command, but I get the feeling this isn’t the best way to run it:

python3 ../../../../tools/launch.py \
--workspace `pwd` \
--num_trainers 1 \
--num_samplers 4 \
--num_servers 1 \
--ip_config ip_config.txt \
--part_config data/reddit.json \
"python3 train_dist.py --graph_name reddit --ip_config ip_config.txt --num_servers 1 ..... --num_workers 4"

Any advice would be appreciated.


In addition, I want to run the system without minibatching and without sampling (the latter is to get GraphSAGE to as close as GCN as possible). For the former, I have set the batch_size parameter to a very large number which makes it so there is only 1 minibatch (i.e. no batches at all). For the latter, it seems that the NeighborSampler is an integrated part of the code. Is there a simple way to disable all sampling so that aggregation works with the entire neighborhood of a vertex?

Thank you.

num_worker = num_sampler, we usually just set it to 1.
Increasing num_server can bring more network bandwidth, I usually set it to 4.
We use omp in many places so do not set num_trainer to large, e.g., 2~4 is enough.

Cool, thank you.

Is there a way to run the system with no minibatching and no sampling?

sorry, for now we only support mini-batch training for large graphs.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.