I have some difficulties to replace torch.distributed.launch
with torchrun
in launch.py
, but I was managed to set all the required envs based on launch.py
. And what I’ve achieved is that I can launch 2 servers and 2 clients with my custom launch script, but there are errors shown in logs. Please allow me to paste my launch scripts and logs below. (I modified them a little bit to hide personal information.)
P.S. Both machine have the same hostname “user-Super-Server”.
Launch Scripts
Ran ./launch_server.sh 0
and ./laucn_client.sh 0
on xxx.xxx.10.17
machine, ./launch_server.sh 1
and ./launch_client.sh 1
on xxx.xxx.9.50
machine.
launch_server.sh
#!/usr/bin/env bash
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
addr=(
"xxx.xxx.10.17"
"xxx.xxx.9.50"
)
port=29500
ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"
set -x
# NOTE: For PyTorch 2.0+
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 9 \
--rdzv-endpoint ${addr[$1]}:$port \
--max-restarts 0 \
--role server \
--node-rank 0 \
$ws/train_dist.py \
--role server \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
launch_client.sh
!/usr/bin/env bash
export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO
addr=(
"xxx.xxx.9.50"
"xxx.xxx.10.17"
)
port=29500
ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"
set -x
# NOTE: For PyTorch 2.0+
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 9 \
--rdzv-endpoint ${addr[$1]}:$port \
--max-restarts 0 \
--role client \
--node-rank 1 \
$ws/train_dist.py \
--role client \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Logs
server-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:36826.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:36826.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55144.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55152.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:52164.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:52164.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[03:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287727 closing signal SIGINT
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f5d98bbf740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195 in wait_for_senders
File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101 in start_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471 in start
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287302 got signal: 2
server-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:35692.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:35692.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47256.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47262.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:47888.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:47888.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[15:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[15:54:31] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.10.17:30050 , error: Cannot assign requested address
Traceback (most recent call last):
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179, in <module>
main()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138, in main
dgl.distributed.initialize(
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278, in initialize
serv.start()
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471, in start
start_server(
File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101, in start_server
rpc.wait_for_senders(
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195, in wait_for_senders
_CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [15:54:31] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to xxx.xxx.10.17:30050
Stack trace:
[bt] (0) /home/myid/ws/py_ws/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f1529d87235]
[bt] (1) /home/myid/ws/py_ws/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7f152a29d91c]
[bt] (2) /home/myid/ws/py_ws/dgl/libdgl.so(+0x8a7bc8) [0x7f152a2a7bc8]
[bt] (3) /home/myid/ws/py_ws/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f152a115f88]
[bt] (4) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7f15286155e3]
[bt] (5) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7f1528615c0b]
[bt] (6) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyObject_MakeTpCall+0x26b) [0x560494c86a6b]
[bt] (7) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyEval_EvalFrameDefault+0x4eb6) [0x560494c823e6]
[bt] (8) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560494c8d99c]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3749488) of binary: /home/myid/programs/mambaforge/envs/p3/bin/python3.10
./launch_server.sh: line 76: 3749226 Killed
client-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47256.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47262.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
Warning! Interface: eno2
IP address not available for interface.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno2
IP address not available for interface.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[03:55:13] 03:55:13/opt/dgl/src/rpc/rpc.cc] :/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting390
:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287464 closing signal SIGINT
[03:55:13] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f2d279b3740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>corrupted double-linked list
Fatal Python error: Aborted
Thread 0x00007f2d279b3740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
, scipy.optimize._lsap
LIBXSMM_VERSION: main-1.17-3659 (25693771), scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode
LIBXSMM_TARGET: hsw [AMD EPYC 7502 32-Core Processor]
, scipy.integrate._dopRegistry and code: 13 MB
, scipy.integrate._lsodaCommand: /home/myid, nscipy.special.cython_specialg/programs/mambaforge/envs, /scipy.stats._statsp3/bin/python3.10 -u /home/myid, hscipy.stats.beta_ufuncang/ws/py_ws/p3-demo/train_dist.py --role, scipy.stats._boost.beta_ufunc client --ip_config /home/, liscipy.stats.binom_ufuncjihang/ws/py_ws/p3-demo/s, cscipy.stats._boost.binom_ufuncript/ip_config.txt --part_config, /scipy.stats.nbinom_ufunchome/myid/ws/py_ws/p3, -dscipy.stats._boost.nbinom_ufuncemo/dataset/partitioned, /scipy.stats.hypergeom_ufuncogbn-arxiv/ogbn-arxiv, .jscipy.stats._boost.hypergeom_ufuncson --num_client 1 --num_serv, er scipy.stats.ncf_ufunc1 --num_sampler 1 ,
scipy.stats._boost.ncf_ufuncUptime: 44.711290 s
, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast (total: 150)
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287399 got signal: 2
client-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55144.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55152.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
====================
now client is connected
====================
Initializing DGL...
Warning! Interface: eno1
IP address not available for interface.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno1
IP address not available for interface.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[15:55:15] /opt/dgl/src/rpc/rpc.cc:15:55:15390] :
User pressed Ctrl+C, Exiting
/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3749920 closing signal SIGINT
[15:55:15] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault
Current thread 0x00007f4492966740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>
Traceback (most recent call last):
File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3749852 got signal: 2
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown