I am trying to do distributed training with DGL and PyTorch backend but the training is hanging for some reason and then timing out. Below is the trace:
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.74 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=0; module load miniconda; conda activate gnn; python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.75 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=1; module load miniconda; conda activate gnn; python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.77 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=2; module load miniconda; conda activate gnn; python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.78 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=3; module load miniconda; conda activate gnn; python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
('172.20.8.74', 1234)
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.74 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda; conda activate gnn; python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=0 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
('172.20.8.75', 1234)
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.75 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda; conda activate gnn; python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=1 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
('172.20.8.77', 1234)
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.77 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda; conda activate gnn; python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=2 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
('172.20.8.78', 1234)
ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.78 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda; conda activate gnn; python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=3 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'
Warning: Permanently added '172.20.8.78' (ECDSA) to the list of known hosts.
cleanupu process runs
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
Lmod is automatically replacing "intel/17" with "gcc/7.3".
[05:10:31] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.74:30050].
Warning! Interface: em2
IP address not available for interface.
[05:10:31] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.78:30050].
Warning! Interface: em2
IP address not available for interface.
[05:10:32] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.77:30050].
Warning! Interface: em2
IP address not available for interface.
[05:10:33] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.75:30050].
Warning! Interface: em2
IP address not available for interface.
[05:10:33] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.75:43344].
Client [29424] waits on 172.20.8.75:43344
[05:10:34] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.77:50650].
Client [31818] waits on 172.20.8.77:50650
[05:10:35] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.78:47348].
Client [10965] waits on 172.20.8.78:47348
Client [4918] waits on 172.20.8.74:46805
[05:10:35] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.74:46805].
Machine (1) group (0) client (1) connect to server successfuly!
Machine (0) group (0) client (0) connect to server successfuly!
Machine (2) group (0) client (2) connect to server successfuly!
Machine (3) group (0) client (3) connect to server successfuly!
Traceback (most recent call last):
File "/home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py", line 87, in <module>
model = torch.nn.parallel.DistributedDataParallel(model)
File "/home/ksharma2/.conda/envs/gnn/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
Client[0] in group[0] is exiting...
dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete
This is my run command from the command line
python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/launch.py \
--workspace /home/ksharma2/ \
--num_trainers 1 \
--ssh_username=ksharma2 \
--num_samplers 0 \
--num_servers 1 \
--part_config netsec/4part_data/twibot-dataset.json \
--extra_envs GLOO_SOCKET_IFNAME=ib0 \
--ip_config RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt \
"module load miniconda;\
conda activate gnn;\
python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py"
I am using DGL version 0.8.0
and torch 1.10.2+cu102
. My linux is Red Hat Enterprise Linux Server release 7.9 (Maipo)
If you want to see ibstat
output, it is below:
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.28.4512
Hardware version: 0
Node GUID: 0x7cfe9003002691c0
System image GUID: 0x7cfe9003002691c0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 137
LMC: 0
SM lid: 11
Capability mask: 0x2651e848
Port GUID: 0x7cfe9003002691c0
Link layer: InfiniBand