Time out when lauching Distributed training

I’ve followed example of training tutorials (dgl/README.md at master · dmlc/dgl · GitHub). The standalone mode works well. Then moving on to distributed training. Setup is: two machines, and then start launch.py command from one of them. I’ve made sure ssh works between these two nodes (both directions).

Then got TimeOut error likes this (despite some successful messages):

(dgl) [centos@n231-013-071 experimental]$ python ~/github/dgl/tools/launch.py --workspace ~/github/dgl/examples/pytorch/graphsage/experimental/ --num_trainers 1 --num_samplers 2 --num_servers 2 --part_config data/ogb-product.json --ip_config ip_config.txt “python train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000”
The number of OMP threads per trainer is set to 8
Using backend: pytorch
Using backend: pytorch
Namespace(batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out=‘10,25’, graph_name=‘ogb-product’, id=None, ip_config=‘ip_config.txt’, local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, part_config=None, standalone=False)
Namespace(batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out=‘10,25’, graph_name=‘ogb-product’, id=None, ip_config=‘ip_config.txt’, local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, part_config=None, standalone=False)
> Machine (1) client (4) connect to server successfuly!
> Machine (0) client (1) connect to server successfuly!
Traceback (most recent call last):
File “train_dist.py”, line 309, in
main(args)
File “train_dist.py”, line 256, in main
th.distributed.init_process_group(backend=‘gloo’)
File “/home/centos/anaconda3/envs/dgl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 446, in init_process_group
timeout=timeout)
File “/home/centos/anaconda3/envs/dgl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 521, in _new_process_group_helper
timeout=timeout)
> RuntimeError: [/tmp/pip-req-build-mvu0v2f8/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [fe80::f816:3eff:fe09:d41d]:23433: Connection timed out

^C2021-04-19 05:45:03,779 INFO Stop launcher

Most likely I missed something in my setup. Looking into code to understand. Any suggestion which direction I should look into?

Thanks a lot!

this error is from Pytorch distributed. I’m wondering if you have opened the port for Pytorch distributed to communicate with each other.

Little experience on pyTorch distributed training here. For ports in DGL, I am aware of 30050/30051 is used by DGL. I assume you r refering to some other ports other than 30050/30051. It shouldn’t be caused by a zombie process, b/c if so that would be “address in use” error.

So I assume “open the port” means explicitly open some ports, before starting anything. Though I skim through pytorch document (Distributed communication package - torch.distributed — PyTorch 1.8.1 documentation), couldn’t find instruction about this.

Would you please point me to a reference to read and learn how to do that? Thanks a lot.

for pytorch’s distributed training, you need to specify the master port. DGL’s launch script uses the port of 1234 for pytorch’s distributed training. you need to check if this port this is accessible.
please check out how DGL specifies the port for pytorch’s distributed: dgl/launch.py at master · dmlc/dgl · GitHub