A connection problem with DistDGL on docker container

Hi everyone.
I’m trying to run the example from this site “dgl/examples/pytorch/graphsage/dist at master · dmlc/dgl · GitHub” in a Docker container, but I’m encountering a problem. I’m not sure why this is happening. If anyone knows the cause or solution to this problem, I would appreciate your help.

The Docker container’s port 2222 is being forwarded to the host’s port 22.

The following is a command executed in the Docker container.

python3 launch.py --workspace /dgl --num_trainers 2 --num_samplers 0 --num_servers 1 --part_config dataset/reddit_partition/reddit.json --ip_config ip_config.txt --ssh_port=2222 --ssh_username=root "source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5"

Here is the error message.

The number of OMP threads per trainer is set to 5

cleanupu process runs

[13:05:07] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[13:05:07] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[13:05:07] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on <master node address>:2222 , error: Cannot assign requested address
Args: Namespace(backend='gloo', batch_size=512, batch_size_eval=100000, dropout=0.3, epochs=50, eval_every=5, fanout=[15, 10, 5], graph_name='dataset/reddit_partition', hidden_channels=256, ip_config='ip_config.txt', local_rank=None, log_every=20, lr=0.01, net_type='socket', num_gpus=-1, num_layers=3, part_config='dataset/reddit_partition/reddit.json', standalone=False)

4e36cbad068d Initializing DistDGL
load reddit
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
Server is waiting for connections on [<master node address>:2222]...
Traceback (most recent call last):
  File "reddit_sage_dist.py", line 420, in <module>
    main(args)
  File "reddit_sage_dist.py", line 329, in main
    dgl.distributed.initialize(ip_config=args.ip_config, net_type=args.net_type)
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
    serv.start()
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_graph.py", line 477, in start
    net_type=self.net_type,
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc_server.py", line 102, in start_server
    ip_addr, port, num_clients, blocking=net_type == "socket"
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 195, in wait_for_senders
    _CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [13:05:07] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to master node address:2222
Stack trace:
  [bt] (0) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7fbc1d4dba85]
  [bt] (1) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7fbc1d9f145c]
  [bt] (2) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(+0x8a6708) [0x7fbc1d9fb708]
  [bt] (3) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fbc1d869a78]
  [bt] (4) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x16ae7) [0x7fbc4b1fdae7]
  [bt] (5) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x17099) [0x7fbc4b1fe099]
  [bt] (6) python3(_PyObject_FastCallKeywords+0x15c) [0x56285485516c]
  [bt] (7) python3(_PyEval_EvalFrameDefault+0x4715) [0x56285489d6b5]
  [bt] (8) python3(_PyEval_EvalCodeWithName+0x255) [0x5628547eee85]


Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<master node address> 'cd /dgl; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=0; source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' returned non-zero exit status 1.

[13:05:13] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[13:05:13] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[13:05:13] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on <slave node address>:2222 , error: Cannot assign requested address
Args: Namespace(backend='gloo', batch_size=512, batch_size_eval=100000, dropout=0.3, epochs=50, eval_every=5, fanout=[15, 10, 5], graph_name='dataset/reddit_partition', hidden_channels=256, ip_config='ip_config.txt', local_rank=None, log_every=20, lr=0.01, net_type='socket', num_gpus=-1, num_layers=3, part_config='dataset/reddit_partition/reddit.json', standalone=False)

32eabcebef9c Initializing DistDGL
load reddit
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 1 for part 1
Server is waiting for connections on [<slave node address>:2222]...
Traceback (most recent call last):
  File "reddit_sage_dist.py", line 420, in <module>
    main(args)
  File "reddit_sage_dist.py", line 329, in main
    dgl.distributed.initialize(ip_config=args.ip_config, net_type=args.net_type)
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
    serv.start()
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_graph.py", line 477, in start
    net_type=self.net_type,
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc_server.py", line 102, in start_server
    ip_addr, port, num_clients, blocking=net_type == "socket"
  File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 195, in wait_for_senders
    _CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [13:05:13] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to <slave node address>:2222
Stack trace:
  [bt] (0) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7fbb9d4dba85]
  [bt] (1) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7fbb9d9f145c]
  [bt] (2) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(+0x8a6708) [0x7fbb9d9fb708]
  [bt] (3) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fbb9d869a78]
  [bt] (4) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x16ae7) [0x7fbbc83f2ae7]
  [bt] (5) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x17099) [0x7fbbc83f3099]
  [bt] (6) python3(_PyObject_FastCallKeywords+0x15c) [0x5561eb2fd16c]
  [bt] (7) python3(_PyEval_EvalFrameDefault+0x4715) [0x5561eb3456b5]
  [bt] (8) python3(_PyEval_EvalCodeWithName+0x255) [0x5561eb296e85]


Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<slave node address> 'cd /dgl; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=1; source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' returned non-zero exit status 1.

### This is the message that appears when I send a Ctrl+C signal
^C2023-06-07 13:05:23,086 INFO Stop launcher
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<slave node address> 'cd /dgl; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=5 DGL_GROUP_ID=0 ; source .venv/bin/activate && python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<master node address> --master_port=1234 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<master node address> 'cd /dgl; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=5 DGL_GROUP_ID=0 ; source .venv/bin/activate && python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<master node address> --master_port=1234 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' died with <Signals.SIGINT: 2>.
kill process 90084 on <master node address>:2222
Terminated
root@4e36cbad068d:/dgl# kill process 90103 on 163.239.23.145:2222
kill process 90104 on <master node address>:2222
kill process 90105 on <master node address>:2222
kill process 25159 on <slave node address>:2222
kill process 25160 on <slave node address>:2222
kill process 25161 on <slave node address>:2222
cleanup process exits

Here is the current state of the network ports.

### Master node 
$ netstat -lntp
(No info could be read for "-p": geteuid()=1001 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:40993           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:60139           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:30050           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:42243         0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:1234            0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:5939          0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:36227           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:2222            0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:2049            0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:37375           0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::44611                :::*                    LISTEN      -                   
tcp6       0      0 :::44505                :::*                    LISTEN      -                   
tcp6       0      0 :::30050                :::*                    LISTEN      -                   
tcp6       0      0 :::1234                 :::*                    LISTEN      -                   
tcp6       0      0 :::111                  :::*                    LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      -                   
tcp6       0      0 ::1:631                 :::*                    LISTEN      -                   
tcp6       0      0 :::2222                 :::*                    LISTEN      -                   
tcp6       0      0 :::2049                 :::*                    LISTEN      -                   
tcp6       0      0 :::56883                :::*                    LISTEN      -                   
tcp6       0      0 :::39719                :::*                    LISTEN      -   

### Slave node 
netstat -lntp
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:30050           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN      -                   
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:1234            0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:111             0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      -                   
tcp        0      0 0.0.0.0:2222            0.0.0.0:*               LISTEN      -                   
tcp6       0      0 :::30050                :::*                    LISTEN      -                   
tcp6       0      0 ::1:631                 :::*                    LISTEN      -                   
tcp6       0      0 :::1234                 :::*                    LISTEN      -                   
tcp6       0      0 :::111                  :::*                    LISTEN      -                   
tcp6       0      0 :::22                   :::*                    LISTEN      -                   
tcp6       0      0 :::2222                 :::*                    LISTEN      -   

could you make sure if ssh from the main docker container to non-main docker containers on port 2222 with the IPs defined in ip_config.txt is successful?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.