Hi everyone.
I’m trying to run the example from this site “dgl/examples/pytorch/graphsage/dist at master · dmlc/dgl · GitHub” in a Docker container, but I’m encountering a problem. I’m not sure why this is happening. If anyone knows the cause or solution to this problem, I would appreciate your help.
The Docker container’s port 2222 is being forwarded to the host’s port 22.
The following is a command executed in the Docker container.
python3 launch.py --workspace /dgl --num_trainers 2 --num_samplers 0 --num_servers 1 --part_config dataset/reddit_partition/reddit.json --ip_config ip_config.txt --ssh_port=2222 --ssh_username=root "source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5"
Here is the error message.
The number of OMP threads per trainer is set to 5
cleanupu process runs
[13:05:07] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[13:05:07] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[13:05:07] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on <master node address>:2222 , error: Cannot assign requested address
Args: Namespace(backend='gloo', batch_size=512, batch_size_eval=100000, dropout=0.3, epochs=50, eval_every=5, fanout=[15, 10, 5], graph_name='dataset/reddit_partition', hidden_channels=256, ip_config='ip_config.txt', local_rank=None, log_every=20, lr=0.01, net_type='socket', num_gpus=-1, num_layers=3, part_config='dataset/reddit_partition/reddit.json', standalone=False)
4e36cbad068d Initializing DistDGL
load reddit
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
Server is waiting for connections on [<master node address>:2222]...
Traceback (most recent call last):
File "reddit_sage_dist.py", line 420, in <module>
main(args)
File "reddit_sage_dist.py", line 329, in main
dgl.distributed.initialize(ip_config=args.ip_config, net_type=args.net_type)
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_graph.py", line 477, in start
net_type=self.net_type,
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc_server.py", line 102, in start_server
ip_addr, port, num_clients, blocking=net_type == "socket"
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 195, in wait_for_senders
_CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [13:05:07] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to master node address:2222
Stack trace:
[bt] (0) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7fbc1d4dba85]
[bt] (1) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7fbc1d9f145c]
[bt] (2) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(+0x8a6708) [0x7fbc1d9fb708]
[bt] (3) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fbc1d869a78]
[bt] (4) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x16ae7) [0x7fbc4b1fdae7]
[bt] (5) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x17099) [0x7fbc4b1fe099]
[bt] (6) python3(_PyObject_FastCallKeywords+0x15c) [0x56285485516c]
[bt] (7) python3(_PyEval_EvalFrameDefault+0x4715) [0x56285489d6b5]
[bt] (8) python3(_PyEval_EvalCodeWithName+0x255) [0x5628547eee85]
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<master node address> 'cd /dgl; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=0; source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' returned non-zero exit status 1.
[13:05:13] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[13:05:13] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[13:05:13] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on <slave node address>:2222 , error: Cannot assign requested address
Args: Namespace(backend='gloo', batch_size=512, batch_size_eval=100000, dropout=0.3, epochs=50, eval_every=5, fanout=[15, 10, 5], graph_name='dataset/reddit_partition', hidden_channels=256, ip_config='ip_config.txt', local_rank=None, log_every=20, lr=0.01, net_type='socket', num_gpus=-1, num_layers=3, part_config='dataset/reddit_partition/reddit.json', standalone=False)
32eabcebef9c Initializing DistDGL
load reddit
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 1 for part 1
Server is waiting for connections on [<slave node address>:2222]...
Traceback (most recent call last):
File "reddit_sage_dist.py", line 420, in <module>
main(args)
File "reddit_sage_dist.py", line 329, in main
dgl.distributed.initialize(ip_config=args.ip_config, net_type=args.net_type)
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/dist_graph.py", line 477, in start
net_type=self.net_type,
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc_server.py", line 102, in start_server
ip_addr, port, num_clients, blocking=net_type == "socket"
File "/opt/conda/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 195, in wait_for_senders
_CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [13:05:13] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to <slave node address>:2222
Stack trace:
[bt] (0) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7fbb9d4dba85]
[bt] (1) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7fbb9d9f145c]
[bt] (2) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(+0x8a6708) [0x7fbb9d9fb708]
[bt] (3) /opt/conda/lib/python3.7/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fbb9d869a78]
[bt] (4) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x16ae7) [0x7fbbc83f2ae7]
[bt] (5) /opt/conda/lib/python3.7/site-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x17099) [0x7fbbc83f3099]
[bt] (6) python3(_PyObject_FastCallKeywords+0x15c) [0x5561eb2fd16c]
[bt] (7) python3(_PyEval_EvalFrameDefault+0x4715) [0x5561eb3456b5]
[bt] (8) python3(_PyEval_EvalCodeWithName+0x255) [0x5561eb296e85]
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<slave node address> 'cd /dgl; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0 DGL_SERVER_ID=1; source .venv/bin/activate && python3 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' returned non-zero exit status 1.
### This is the message that appears when I send a Ctrl+C signal
^C2023-06-07 13:05:23,086 INFO Stop launcher
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<slave node address> 'cd /dgl; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=5 DGL_GROUP_ID=0 ; source .venv/bin/activate && python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=<master node address> --master_port=1234 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' died with <Signals.SIGINT: 2>.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 2222 root@<master node address> 'cd /dgl; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=dataset/reddit_partition/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=5 DGL_GROUP_ID=0 ; source .venv/bin/activate && python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=<master node address> --master_port=1234 reddit_sage_dist.py --graph_name dataset/reddit_partition --ip_config ip_config.txt --part_config dataset/reddit_partition/reddit.json --num_gpus -1 --backend gloo --epochs 50 --fanout 15 10 5)'' died with <Signals.SIGINT: 2>.
kill process 90084 on <master node address>:2222
Terminated
root@4e36cbad068d:/dgl# kill process 90103 on 163.239.23.145:2222
kill process 90104 on <master node address>:2222
kill process 90105 on <master node address>:2222
kill process 25159 on <slave node address>:2222
kill process 25160 on <slave node address>:2222
kill process 25161 on <slave node address>:2222
cleanup process exits
Here is the current state of the network ports.
### Master node
$ netstat -lntp
(No info could be read for "-p": geteuid()=1001 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:40993 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:60139 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:30050 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:42243 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:1234 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:5939 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:36227 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:2222 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:2049 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:37375 0.0.0.0:* LISTEN -
tcp6 0 0 :::44611 :::* LISTEN -
tcp6 0 0 :::44505 :::* LISTEN -
tcp6 0 0 :::30050 :::* LISTEN -
tcp6 0 0 :::1234 :::* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 ::1:631 :::* LISTEN -
tcp6 0 0 :::2222 :::* LISTEN -
tcp6 0 0 :::2049 :::* LISTEN -
tcp6 0 0 :::56883 :::* LISTEN -
tcp6 0 0 :::39719 :::* LISTEN -
### Slave node
netstat -lntp
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:30050 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:631 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:1234 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:2222 0.0.0.0:* LISTEN -
tcp6 0 0 :::30050 :::* LISTEN -
tcp6 0 0 ::1:631 :::* LISTEN -
tcp6 0 0 :::1234 :::* LISTEN -
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::2222 :::* LISTEN -