Hello.
I am experimenting with DistDGL’s GraphSAGE implementation to get some performance numbers, and I want to make sure I am being fair to the system by making sure I have the runtime parameters configured correctly.
Given H physical machines which each have T cores, what is the best way to set the following parameters?
num_trainers
num_samplers
num_servers
num_workers
I’ve been running 1 for trainers, 4 for samplers, 1 for servers, and 4 for num_workers in the following command, but I get the feeling this isn’t the best way to run it:
python3 ../../../../tools/launch.py \
--workspace `pwd` \
--num_trainers 1 \
--num_samplers 4 \
--num_servers 1 \
--ip_config ip_config.txt \
--part_config data/reddit.json \
"python3 train_dist.py --graph_name reddit --ip_config ip_config.txt --num_servers 1 ..... --num_workers 4"
Any advice would be appreciated.
In addition, I want to run the system without minibatching and without sampling (the latter is to get GraphSAGE to as close as GCN as possible). For the former, I have set the batch_size parameter to a very large number which makes it so there is only 1 minibatch (i.e. no batches at all). For the latter, it seems that the NeighborSampler is an integrated part of the code. Is there a simple way to disable all sampling so that aggregation works with the entire neighborhood of a vertex?
Thank you.