which dgl version are you using? what’s the shared memory you configured for your machines? df -h /dev/shm
I’m using version 1.1.1+cu117
.
df -h /dev/shm
shows:
Filesystem Size Used Avail Use% Mounted on
tmpfs 158G 28M 158G 1% /dev/shm
Do I need to configure shared memory? Like I said, I’m using sshfs
to map the workspace from xxx.xxx.10.17
machine to xxx.xxx.9.50
machine.
which machine is the output of df -h /dev/shm
from? or these 2 machines shares this file system as well?
Sorry about the confuse.
xxx.xxx.10.17:
Filesystem Size Used Avail Use% Mounted on
tmpfs 158G 28M 158G 1% /dev/shm
xxx.xxx.9.50
Filesystem Size Used Avail Use% Mounted on
tmpfs 32G 285M 32G 1% /dev/shm
After I launch the training script (the node_classification.py example), xxx.xxx.10.17
's used memory increased to 101M and xxx.xxx.9.50
's used memory increased to 357M.
shm looks good to me. The callstack you shared indicates seg fault happened when creating DGL formats. could you try to prepend DGL_GRAPH_FORMAT=coo
before node_classification.py xxx
?
Can’t tell since I have no access to the computation resources currently.
There is something that I want to double check: when running a distributed training on 2 machines, is there 1 server and 1 client on each machine, or 1 server and 1 client on one machine and 1 client on another machine?
--num_trainers 1 --num_samplers 0 --num_servers 1
indicates 1 server and 1 client on each machine. each client will connect to all servers on all machines.
Wow… I thought in general there is only 1 global server and there are 1 or more clients.
Does this mean that, in my case, DGL is using xxx.xxx.10.17:<PORT_1>
to start a server and xxx.xxx.9.50:<PORT_2>
to start a client, and doing the similar things on xxx.xxx.9.50
machine?
No. Simply put, DGL has server processes (primary and backup), trainer processes, and sampler processes, all running on the same and every machine. When you run the launch script, DGL first launches the server processes specified by --num_servers
to load the graph partitions. There is one primary server, the rest are backup servers (optional). When the servers are launched successfully, DGL launches the trainer and sampler processes specified by --num_trainers
and --num_samplers
. Generally, the number of trainer processes are the same as the number of GPU’s in the machine (for training on GPUs). You can choose to not use sampler processes at all, in which case sampling is done in the trainer processes.
In summary, every machine has one primary server process that loads the graph partition, optionally backup server processes, one or more trainer processes, and zero or more sampler processes, all running on the same machine and every machine in the distributed training.
DGL is using xxx.xxx.10.17:<PORT_1>
(default is 30050) to start a server. Clients use <PORT_1>
to connect to the server running on the same machine and every other machine. The same happens on every other machine. So, there will be a server running on xxx.xxx.9.50:<PORT_1>
as well.
I’m not sure what’s going on, but right now I’m able to make some progress with torchrun
(of PyTorch 2.0.1) and the server’s launch script didn’t hang until it reached
Server is waiting for connections on [xxx.xxx.10.17:30050]
Please allow me to show the operations.
Server’s Script
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 0 \
--max-restarts 0 \
--role server \
--node-rank 0 \
--master-addr xxx.xxx.10.17 \
--master-port 29500 \
--local-addr 10.30.10.17 \
$ws/train_dist.py \
--role server \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Client’s Script
addr=(
"xxx.xxx.10.17"
"xxx.xxx.9.50"
)
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 0 \
--max-restarts 0 \
--role client \
--node-rank 1 \
--master-addr xxx.xxx.10.17 \
--master-port $port \
--local-addr ${addr[$1]} \
$ws/train_dist.py \
--role client \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Launching
Machine xxx.xxx.10.17
./launch_server.sh
./launch_client.sh 0
Machine xxx.xxx.9.50
./launch_client.sh 1
I might have done something wrong, but I don’t know where is it.
Thanks in advanced.
Edit:
I think I should launch a server on xxx.xxx…9.50 as well. I’ll get back later.
No luck though.
Is the distributed topology of DGL the same as that in PyTorch? It seems that the torchrun
example (website and code) runs on 1 server and 1 client on 2 different machines.
Once again, your help is much much appreciated. Thanks!
Usually we’d better launch via launch.py
and replace torch.distributed.launch
with torchrun
. As you can see in launch.py
, launching by yourself is very error-prone as several envs are required to be set during launch. Please configure export TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO
as well.
One major issue in your launch is you need to launch server on both machines, then launch clients on both machines. @pubu has introduced the details of server and client.
In order to figure out the root cause of hang in your case, we could also try to launch manually. You could obtain the launch cmd for server and client via checking processes details ps
when you launch via launch.py
. You will find cmd like below. DGL_ROLE
indicates the process type: server or client.
export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PAT ...
I have some difficulties to replace torch.distributed.launch
with torchrun
in launch.py
, but I was managed to set all the required envs based on launch.py
. And what I’ve achieved is that I can launch 2 servers and 2 clients with my custom launch script, but there are errors shown in logs. Please allow me to paste my launch scripts and logs below. (I modified them a little bit to hide personal information.)
P.S. Both machine have the same hostname “user-Super-Server”.
Launch Scripts
Ran ./launch_server.sh 0
and ./laucn_client.sh 0
on xxx.xxx.10.17
machine, ./launch_server.sh 1
and ./launch_client.sh 1
on xxx.xxx.9.50
machine.
launch_server.sh
#!/usr/bin/env bash
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
addr=(
"xxx.xxx.10.17"
"xxx.xxx.9.50"
)
port=29500
ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"
set -x
# NOTE: For PyTorch 2.0+
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 9 \
--rdzv-endpoint ${addr[$1]}:$port \
--max-restarts 0 \
--role server \
--node-rank 0 \
$ws/train_dist.py \
--role server \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
launch_client.sh
!/usr/bin/env bash
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
addr=(
"xxx.xxx.9.50"
"xxx.xxx.10.17"
)
port=29500
ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"
set -x
# NOTE: For PyTorch 2.0+
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 9 \
--rdzv-endpoint ${addr[$1]}:$port \
--max-restarts 0 \
--role client \
--node-rank 1 \
$ws/train_dist.py \
--role client \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Logs
server-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:36826.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:36826.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55144.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55152.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:52164.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:52164.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[03:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287727 closing signal SIGINT
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f5d98bbf740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195 in wait_for_senders
File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101 in start_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471 in start
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287302 got signal: 2
server-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:35692.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:35692.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47256.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47262.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:47888.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:47888.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[15:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[15:54:31] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.10.17:30050 , error: Cannot assign requested address
Traceback (most recent call last):
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179, in <module>
main()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138, in main
dgl.distributed.initialize(
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471, in start
start_server(
File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101, in start_server
rpc.wait_for_senders(
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195, in wait_for_senders
_CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [15:54:31] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to xxx.xxx.10.17:30050
Stack trace:
[bt] (0) /home/myid/ws/py_ws/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f1529d87235]
[bt] (1) /home/myid/ws/py_ws/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7f152a29d91c]
[bt] (2) /home/myid/ws/py_ws/dgl/libdgl.so(+0x8a7bc8) [0x7f152a2a7bc8]
[bt] (3) /home/myid/ws/py_ws/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f152a115f88]
[bt] (4) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7f15286155e3]
[bt] (5) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7f1528615c0b]
[bt] (6) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyObject_MakeTpCall+0x26b) [0x560494c86a6b]
[bt] (7) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyEval_EvalFrameDefault+0x4eb6) [0x560494c823e6]
[bt] (8) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560494c8d99c]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3749488) of binary: /home/myid/programs/mambaforge/envs/p3/bin/python3.10
./launch_server.sh: line 76: 3749226 Killed
client-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47256.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47262.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
Warning! Interface: eno2
IP address not available for interface.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno2
IP address not available for interface.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[03:55:13] 03:55:13/opt/dgl/src/rpc/rpc.cc] :/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting390
:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287464 closing signal SIGINT
[03:55:13] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f2d279b3740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>corrupted double-linked list
Fatal Python error: Aborted
Thread 0x00007f2d279b3740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
, scipy.optimize._lsap
LIBXSMM_VERSION: main-1.17-3659 (25693771), scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode
LIBXSMM_TARGET: hsw [AMD EPYC 7502 32-Core Processor]
, scipy.integrate._dopRegistry and code: 13 MB
, scipy.integrate._lsodaCommand: /home/myid, nscipy.special.cython_specialg/programs/mambaforge/envs, /scipy.stats._statsp3/bin/python3.10 -u /home/myid, hscipy.stats.beta_ufuncang/ws/py_ws/p3-demo/train_dist.py --role, scipy.stats._boost.beta_ufunc client --ip_config /home/, liscipy.stats.binom_ufuncjihang/ws/py_ws/p3-demo/s, cscipy.stats._boost.binom_ufuncript/ip_config.txt --part_config, /scipy.stats.nbinom_ufunchome/myid/ws/py_ws/p3, -dscipy.stats._boost.nbinom_ufuncemo/dataset/partitioned, /scipy.stats.hypergeom_ufuncogbn-arxiv/ogbn-arxiv, .jscipy.stats._boost.hypergeom_ufuncson --num_client 1 --num_serv, er scipy.stats.ncf_ufunc1 --num_sampler 1 ,
scipy.stats._boost.ncf_ufuncUptime: 44.711290 s
, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast (total: 150)
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287399 got signal: 2
client-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55144.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55152.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
Warning! Interface: eno1
IP address not available for interface.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno1
IP address not available for interface.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[15:55:15] /opt/dgl/src/rpc/rpc.cc:15:55:15390] :
User pressed Ctrl+C, Exiting
/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3749920 closing signal SIGINT
[15:55:15] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f4492966740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3749852 got signal: 2
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
As I said, you need to specify several env variables when launch client/server like below:
export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=xxx.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo DGL_SERVER_ID=0 train_dist.py xxx
You are correct. I have configured the environments as you suggested, but I mistakenly assigned incorrect values to DGL_SERVER_ID
. Although I have made some progress, I encountered an error on the server with IP address xxx.xxx.10.17. The logs provided below are much shorter compared to the previous ones and correspond to the outputs of the same launch scripts I mentioned in my last reply. The error is related to socket connection, but I must have some misunderstanding about the IP configuration for DGL then I haven’t succeeded yet.
ip_config.txt
xxx.xxx.10.17 30050
xxx.xxx.9.50 30050
server-1017.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:47596.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:47596.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:56172.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:56182.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:52164.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:52164.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
==============================
Initializing DGL...
==============================
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[09:30:17] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[09:30:17] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[09:30:17] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.10.17:30050 , error: Address already in use
Traceback (most recent call last):
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 182, in <module>
main(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 136, in main
dgl.distributed.initialize(
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471, in start
start_server(
File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101, in start_server
rpc.wait_for_senders(
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195, in wait_for_senders
_CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [09:30:17] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to xxx.xxx.10.17:30050
Stack trace:
[bt] (0) /home/myid/ws/py_ws/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f33a7187235]
[bt] (1) /home/myid/ws/py_ws/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7f33a769d91c]
[bt] (2) /home/myid/ws/py_ws/dgl/libdgl.so(+0x8a7bc8) [0x7f33a76a7bc8]
[bt] (3) /home/myid/ws/py_ws/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f33a7515f88]
[bt] (4) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7f33a5a155e3]
[bt] (5) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7f33a5a15c0b]
[bt] (6) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyObject_MakeTpCall+0x26b) [0x55e60d23da6b]
[bt] (7) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyEval_EvalFrameDefault+0x4eb6) [0x55e60d2393e6]
[bt] (8) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x55e60d24499c]
server-950.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:37454.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:37454.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:45798.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:45938.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:45806.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:45938.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
==============================
Initializing DGL...
==============================
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 1 for part 1
[21:30:09] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[21:30:09] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.9.50:30050]...
client-1017.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:45798.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:45806.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
==============================
Initializing DGL...
==============================
[09:30:07] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[09:30:07] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
client-950.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:56172.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:56182.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
==============================
Initializing DGL...
==============================
Warning! Interface: eno1
IP address not available for interface.
[21:30:17] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[21:30:17] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Please make sure no other processes occupy this port.
Please make sure the commands of launching server/client is consistent with generated by launch.py
.
The training can be launched via DGL’s launch script now, but hangs with my custom script. I think I’ll stick with DGL’s script for now.
Guys, I genuinely appreciate your dedicated efforts in helping me with this issue.
are your aware of what kind of changes you made that makes DGL launch work now?
I’m not sure actually… I’ve been tweaking ip_config.txt
and my script referring DGL’s launch.py
and launch.sh
. Then DGL’s script just work suddenly.
My script, based on torchrun
, manages to make connections between 2 servers and 2 clients, but nothing happens afterwards. The torch.distributed.init_process_group
seems never return.