Hi,
I followed steps in README(and passwordless SSH login setting), trying to do distributed training with two virtual machines each with one nvidia tesla p100 gpu. However, the connection timed out error keeps coming up.
python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_trainers 2 --num_samplers 0 --num_servers 1 --part_config data/ogb-product.json --ip_config ip_config.txt --keep_alive --server_name long_live "python3 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1"
Servers will keep alive even clients exit...
The number of OMP threads per trainer is set to 4
Monitor file for alive servers already exist: /tmp/dgl_dist_monitor_long_live.
Use running server long_live.
cleanupu process runs
ssh: connect to host 172.31.2.66 port 22: Connection timed out
ssh: connect to host 172.31.1.191 port 22: Connection timed out
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/home/luyc/workspace/dgl/tools/launch.py", line 109, in run
File "/home/luyc/workspace/dgl/tools/launch.py", line 109, in run
subprocess.check_call(ssh_cmd, shell=True)
subprocess.check_call(ssh_cmd, shell=True)
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 172.31.2.66 'cd /home/luyc/workspace/dgl/examples/pytorch/graphsage/dist/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=data/ogb-product.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=1 ; python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=172.31.2.66 --master_port=1234 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1)'' returned non-zero exit status 255.
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 172.31.1.191 'cd /home/luyc/workspace/dgl/examples/pytorch/graphsage/dist/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=data/ogb-product.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=1 ; python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=172.31.2.66 --master_port=1234 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1)'' returned non-zero exit status 255.
OS: ubuntu 20.04
Python: v3.8.10
When running ping <nfs-server-ip>
on the client machine, it shows it does connect to the server machine. Iβm wondering how I can address the error. Any help is appreciated. Thank you!