Time out when lauching Distributed training

HuangLED · April 19, 2021, 4:55pm

I’ve followed example of training tutorials (dgl/README.md at master · dmlc/dgl · GitHub). The standalone mode works well. Then moving on to distributed training. Setup is: two machines, and then start launch.py command from one of them. I’ve made sure ssh works between these two nodes (both directions).

Then got TimeOut error likes this (despite some successful messages):

(dgl) [centos@n231-013-071 experimental]$ python ~/github/dgl/tools/launch.py --workspace ~/github/dgl/examples/pytorch/graphsage/experimental/ --num_trainers 1 --num_samplers 2 --num_servers 2 --part_config data/ogb-product.json --ip_config ip_config.txt “python train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 30 --batch_size 1000”
The number of OMP threads per trainer is set to 8
Using backend: pytorch
Using backend: pytorch
Namespace(batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out=‘10,25’, graph_name=‘ogb-product’, id=None, ip_config=‘ip_config.txt’, local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, part_config=None, standalone=False)
Namespace(batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out=‘10,25’, graph_name=‘ogb-product’, id=None, ip_config=‘ip_config.txt’, local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, part_config=None, standalone=False)
> Machine (1) client (4) connect to server successfuly!
> Machine (0) client (1) connect to server successfuly!
Traceback (most recent call last):
File “train_dist.py”, line 309, in
main(args)
File “train_dist.py”, line 256, in main
th.distributed.init_process_group(backend=‘gloo’)
File “/home/centos/anaconda3/envs/dgl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 446, in init_process_group
timeout=timeout)
File “/home/centos/anaconda3/envs/dgl/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 521, in _new_process_group_helper
timeout=timeout)
> RuntimeError: [/tmp/pip-req-build-mvu0v2f8/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [fe80::f816:3eff:fe09:d41d]:23433: Connection timed out

^C2021-04-19 05:45:03,779 INFO Stop launcher

Most likely I missed something in my setup. Looking into code to understand. Any suggestion which direction I should look into?

Thanks a lot!

zhengda1936 · April 20, 2021, 10:23am

this error is from Pytorch distributed. I’m wondering if you have opened the port for Pytorch distributed to communicate with each other.

HuangLED · April 20, 2021, 4:37pm

Little experience on pyTorch distributed training here. For ports in DGL, I am aware of 30050/30051 is used by DGL. I assume you r refering to some other ports other than 30050/30051. It shouldn’t be caused by a zombie process, b/c if so that would be “address in use” error.

So I assume “open the port” means explicitly open some ports, before starting anything. Though I skim through pytorch document (Distributed communication package - torch.distributed — PyTorch 1.8.1 documentation), couldn’t find instruction about this.

Would you please point me to a reference to read and learn how to do that? Thanks a lot.

zhengda1936 · April 29, 2021, 6:17am

for pytorch’s distributed training, you need to specify the master port. DGL’s launch script uses the port of 1234 for pytorch’s distributed training. you need to check if this port this is accessible.
please check out how DGL specifies the port for pytorch’s distributed: dgl/launch.py at master · dmlc/dgl · GitHub

HuangLED · May 20, 2021, 5:18pm

Single machine training works perfectly fine, but distributed training always run into issue.

I have a two machine cluster, ssh between them works fine. The screenshot above probably better explains my confusion.

two machines, each has num_trainer=2, then each trainer has 2 sampler, each machine has one server (num_server=1). There should be 222 = 8 clients in total, right? But now I only see 4 clients connected. After that the whole thing got stuck.

with my setup, (two-machine cluster, num_trainer=2, num_sampler=2, num_servers=1), is this output as shown in the screenshot expected at all?

Thanks a lot.

zhengda1936 · June 2, 2021, 5:51am

this is weird. do you still have this issue?

HuangLED · June 2, 2021, 4:37pm

This time-out problem is now gone after trying out several things. I am not 100% sure what the root cause is but switching from python3.7 to python3.8 played a big part in it.

Thanks!

zhengda1936 · June 3, 2021, 1:54am

this is weird. our original code was tested on python 3.6. then we tested it on python 3.8. i think we didn’t try 3.7.

system · July 3, 2021, 1:54am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.