Have Problem with Setting Up IP Config

which dgl version are you using? what’s the shared memory you configured for your machines? df -h /dev/shm

I’m using version 1.1.1+cu117.

df -h /dev/shm shows:

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           158G   28M  158G   1% /dev/shm

Do I need to configure shared memory? Like I said, I’m using sshfs to map the workspace from xxx.xxx.10.17 machine to xxx.xxx.9.50 machine.

which machine is the output of df -h /dev/shm from? or these 2 machines shares this file system as well?

Sorry about the confuse.

xxx.xxx.10.17:

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           158G   28M  158G   1% /dev/shm

xxx.xxx.9.50

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            32G  285M   32G   1% /dev/shm

After I launch the training script (the node_classification.py example), xxx.xxx.10.17's used memory increased to 101M and xxx.xxx.9.50's used memory increased to 357M.

shm looks good to me. The callstack you shared indicates seg fault happened when creating DGL formats. could you try to prepend DGL_GRAPH_FORMAT=coo before node_classification.py xxx?

Can’t tell since I have no access to the computation resources currently.

There is something that I want to double check: when running a distributed training on 2 machines, is there 1 server and 1 client on each machine, or 1 server and 1 client on one machine and 1 client on another machine?

--num_trainers 1 --num_samplers 0 --num_servers 1 indicates 1 server and 1 client on each machine. each client will connect to all servers on all machines.

Wow… I thought in general there is only 1 global server and there are 1 or more clients.

Does this mean that, in my case, DGL is using xxx.xxx.10.17:<PORT_1> to start a server and xxx.xxx.9.50:<PORT_2> to start a client, and doing the similar things on xxx.xxx.9.50 machine?

No. Simply put, DGL has server processes (primary and backup), trainer processes, and sampler processes, all running on the same and every machine. When you run the launch script, DGL first launches the server processes specified by --num_servers to load the graph partitions. There is one primary server, the rest are backup servers (optional). When the servers are launched successfully, DGL launches the trainer and sampler processes specified by --num_trainers and --num_samplers. Generally, the number of trainer processes are the same as the number of GPU’s in the machine (for training on GPUs). You can choose to not use sampler processes at all, in which case sampling is done in the trainer processes.

In summary, every machine has one primary server process that loads the graph partition, optionally backup server processes, one or more trainer processes, and zero or more sampler processes, all running on the same machine and every machine in the distributed training.

1 Like

DGL is using xxx.xxx.10.17:<PORT_1> (default is 30050) to start a server. Clients use <PORT_1> to connect to the server running on the same machine and every other machine. The same happens on every other machine. So, there will be a server running on xxx.xxx.9.50:<PORT_1> as well.

@Rhett-Ying @Rhett-Ying

I’m not sure what’s going on, but right now I’m able to make some progress with torchrun (of PyTorch 2.0.1) and the server’s launch script didn’t hang until it reached

Server is waiting for connections on [xxx.xxx.10.17:30050]

Please allow me to show the operations.

Server’s Script

torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        0 \
  --max-restarts   0 \
  --role           server \
  --node-rank      0 \
  --master-addr    xxx.xxx.10.17 \
  --master-port    29500 \
  --local-addr     10.30.10.17 \
  $ws/train_dist.py \
  --role server \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1

Client’s Script

addr=(
  "xxx.xxx.10.17"
  "xxx.xxx.9.50"
)

torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        0 \
  --max-restarts   0 \
  --role           client \
  --node-rank      1 \
  --master-addr    xxx.xxx.10.17 \
  --master-port    $port \
  --local-addr     ${addr[$1]} \
  $ws/train_dist.py \
  --role client \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1

Launching

Machine xxx.xxx.10.17

./launch_server.sh
./launch_client.sh 0

Machine xxx.xxx.9.50

./launch_client.sh 1

I might have done something wrong, but I don’t know where is it.

Thanks in advanced.

Edit:
I think I should launch a server on xxx.xxx…9.50 as well. I’ll get back later. :stuck_out_tongue:

No luck though.

Is the distributed topology of DGL the same as that in PyTorch? It seems that the torchrun example (website and code) runs on 1 server and 1 client on 2 different machines.

Once again, your help is much much appreciated. Thanks!

Usually we’d better launch via launch.py and replace torch.distributed.launch with torchrun. As you can see in launch.py, launching by yourself is very error-prone as several envs are required to be set during launch. Please configure export TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO as well.

One major issue in your launch is you need to launch server on both machines, then launch clients on both machines. @pubu has introduced the details of server and client.

In order to figure out the root cause of hang in your case, we could also try to launch manually. You could obtain the launch cmd for server and client via checking processes details ps when you launch via launch.py. You will find cmd like below. DGL_ROLE indicates the process type: server or client.

export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PAT ...

I have some difficulties to replace torch.distributed.launch with torchrun in launch.py, but I was managed to set all the required envs based on launch.py. And what I’ve achieved is that I can launch 2 servers and 2 clients with my custom launch script, but there are errors shown in logs. Please allow me to paste my launch scripts and logs below. (I modified them a little bit to hide personal information.)

P.S. Both machine have the same hostname “user-Super-Server”.

Launch Scripts

Ran ./launch_server.sh 0 and ./laucn_client.sh 0 on xxx.xxx.10.17 machine, ./launch_server.sh 1 and ./launch_client.sh 1 on xxx.xxx.9.50 machine.

launch_server.sh
#!/usr/bin/env bash


export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO


addr=(
  "xxx.xxx.10.17"
  "xxx.xxx.9.50"
)

port=29500

ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"

set -x

# NOTE: For PyTorch 2.0+
torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        9 \
  --rdzv-endpoint  ${addr[$1]}:$port \
  --max-restarts   0 \
  --role           server \
  --node-rank      0 \
  $ws/train_dist.py \
  --role server \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1
launch_client.sh
!/usr/bin/env bash


export TORCH_DISTRIBUTED_DEBUG=DETAIL
export TORCH_CPP_LOG_LEVEL=INFO


addr=(
  "xxx.xxx.9.50"
  "xxx.xxx.10.17"
)

port=29500

ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"

set -x

# NOTE: For PyTorch 2.0+
torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        9 \
  --rdzv-endpoint  ${addr[$1]}:$port \
  --max-restarts   0 \
  --role           client \
  --node-rank      1 \
  $ws/train_dist.py \
  --role client \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1

Logs

server-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:36826.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:36826.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55144.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:55152.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:52164.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:52164.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


====================
now client is connected
====================


Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[03:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287727 closing signal SIGINT
[03:55:17] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault

Current thread 0x00007f5d98bbf740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195 in wait_for_senders
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101 in start_server
  File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471 in start
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278 in initialize
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>

Traceback (most recent call last):
  File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287302 got signal: 2
server-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role server --node-rank 0 /home/myid/ws/py_ws/p3-demo/train_dist.py --role server --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:35692.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:35692.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47256.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:47262.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:47888.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:47888.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


====================
now client is connected
====================


Initializing DGL...
load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[15:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[15:54:31] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.10.17:30050 , error: Cannot assign requested address
Traceback (most recent call last):
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179, in <module>
    main()
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138, in main
    dgl.distributed.initialize(
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278, in initialize
    serv.start()
  File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471, in start
    start_server(
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101, in start_server
    rpc.wait_for_senders(
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195, in wait_for_senders
    _CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [15:54:31] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to xxx.xxx.10.17:30050
Stack trace:
  [bt] (0) /home/myid/ws/py_ws/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f1529d87235]
  [bt] (1) /home/myid/ws/py_ws/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7f152a29d91c]
  [bt] (2) /home/myid/ws/py_ws/dgl/libdgl.so(+0x8a7bc8) [0x7f152a2a7bc8]
  [bt] (3) /home/myid/ws/py_ws/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f152a115f88]
  [bt] (4) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7f15286155e3]
  [bt] (5) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7f1528615c0b]
  [bt] (6) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyObject_MakeTpCall+0x26b) [0x560494c86a6b]
  [bt] (7) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyEval_EvalFrameDefault+0x4eb6) [0x560494c823e6]
  [bt] (8) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x560494c8d99c]


ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3749488) of binary: /home/myid/programs/mambaforge/envs/p3/bin/python3.10
./launch_server.sh: line 76: 3749226 Killed
client-1017.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.9.50:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47256.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:47262.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


====================
now client is connected
====================


Initializing DGL...
Warning! Interface: eno2
IP address not available for interface.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:29] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno2
IP address not available for interface.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[03:54:31] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[03:55:13] 03:55:13/opt/dgl/src/rpc/rpc.cc] :/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting390
:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3287464 closing signal SIGINT
[03:55:13] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault

Current thread 0x00007f2d279b3740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>corrupted double-linked list

Fatal Python error: Aborted

Thread 0x00007f2d279b3740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>

, scipy.optimize._lsap
LIBXSMM_VERSION: main-1.17-3659 (25693771), scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode
LIBXSMM_TARGET: hsw [AMD EPYC 7502 32-Core Processor]
, scipy.integrate._dopRegistry and code: 13 MB
, scipy.integrate._lsodaCommand: /home/myid, nscipy.special.cython_specialg/programs/mambaforge/envs, /scipy.stats._statsp3/bin/python3.10 -u /home/myid, hscipy.stats.beta_ufuncang/ws/py_ws/p3-demo/train_dist.py --role, scipy.stats._boost.beta_ufunc client --ip_config /home/, liscipy.stats.binom_ufuncjihang/ws/py_ws/p3-demo/s, cscipy.stats._boost.binom_ufuncript/ip_config.txt --part_config,  /scipy.stats.nbinom_ufunchome/myid/ws/py_ws/p3, -dscipy.stats._boost.nbinom_ufuncemo/dataset/partitioned, /scipy.stats.hypergeom_ufuncogbn-arxiv/ogbn-arxiv, .jscipy.stats._boost.hypergeom_ufuncson --num_client 1 --num_serv, er scipy.stats.ncf_ufunc1 --num_sampler 1 ,
scipy.stats._boost.ncf_ufuncUptime: 44.711290 s
, scipy.stats.ncx2_ufunc, scipy.stats._boost.ncx2_ufunc, scipy.stats.nct_ufunc, scipy.stats._boost.nct_ufunc, scipy.stats.skewnorm_ufunc, scipy.stats._boost.skewnorm_ufunc, scipy.stats.invgauss_ufunc, scipy.stats._boost.invgauss_ufunc, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._statlib, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, scipy.stats._rcont.rcont, sklearn.utils._isfinite, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.utils._logistic_sigmoid, sklearn.utils.sparsefuncs_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_fast (total: 150)
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Traceback (most recent call last):
  File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3287399 got signal: 2
client-950.log
+ torchrun --nnodes 2 --nproc-per-node 1 --rdzv-backend static --rdzv-id 9 --rdzv-endpoint xxx.xxx.10.17:29500 --max-restarts 0 --role client --node-rank 1 /home/myid/ws/py_ws/p3-demo/train_dist.py --role client --ip_config /home/myid/ws/py_ws/p3-demo/script/ip_config.txt --part_config /home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json --num_client 1 --num_server 1 --num_sampler 1

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55144.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:55152.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


====================
now client is connected
====================


Initializing DGL...
Warning! Interface: eno1
IP address not available for interface.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:47] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
Warning! Interface: eno1
IP address not available for interface.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[15:54:49] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
[[15:55:15] /opt/dgl/src/rpc/rpc.cc:15:55:15390] :
User pressed Ctrl+C, Exiting
/opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
WARNING:torch.distributed.elastic.agent.server.api:Received 2 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3749920 closing signal SIGINT
[15:55:15] /opt/dgl/src/rpc/rpc.cc:390:
User pressed Ctrl+C, Exiting
Fatal Python error: Segmentation fault

Current thread 0x00007f4492966740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 230 in connect_receiver_finalize
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_client.py", line 213 in connect_to_server
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 310 in initialize
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 138 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 179 in <module>

Traceback (most recent call last):
  File "/home/myid/programs/mambaforge/envs/p3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
    time.sleep(monitor_interval)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
    raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3749852 got signal: 2
/home/myid/programs/mambaforge/envs/p3/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 10 leaked semaphore objects to clean up at shutdown

As I said, you need to specify several env variables when launch client/server like below:

export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PATH=xxx.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc,coo  DGL_SERVER_ID=0 train_dist.py xxx

You are correct. I have configured the environments as you suggested, but I mistakenly assigned incorrect values to DGL_SERVER_ID. Although I have made some progress, I encountered an error on the server with IP address xxx.xxx.10.17. The logs provided below are much shorter compared to the previous ones and correspond to the outputs of the same launch scripts I mentioned in my last reply. The error is related to socket connection, but I must have some misunderstanding about the IP configuration for DGL then I haven’t succeeded yet.

ip_config.txt

xxx.xxx.10.17 30050
xxx.xxx.9.50 30050

server-1017.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:47596.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:47596.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:56172.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.9.50]:56182.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:52164.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:52164.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


==============================
Initializing DGL...
==============================


load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[09:30:17] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[09:30:17] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.10.17:30050]...
[09:30:17] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.10.17:30050 , error: Address already in use
Traceback (most recent call last):
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 182, in <module>
    main(args)
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/myid/ws/py_ws/p3-demo/train_dist.py", line 136, in main
    dgl.distributed.initialize(
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 278, in initialize
    serv.start()
  File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 471, in start
    start_server(
  File "/home/myid/ws/py_ws/dgl/distributed/rpc_server.py", line 101, in start_server
    rpc.wait_for_senders(
  File "/home/myid/ws/py_ws/dgl/distributed/rpc.py", line 195, in wait_for_senders
    _CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking)
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall
dgl._ffi.base.DGLError: [09:30:17] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to xxx.xxx.10.17:30050
Stack trace:
  [bt] (0) /home/myid/ws/py_ws/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7f33a7187235]
  [bt] (1) /home/myid/ws/py_ws/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7f33a769d91c]
  [bt] (2) /home/myid/ws/py_ws/dgl/libdgl.so(+0x8a7bc8) [0x7f33a76a7bc8]
  [bt] (3) /home/myid/ws/py_ws/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f33a7515f88]
  [bt] (4) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7f33a5a155e3]
  [bt] (5) /home/myid/ws/py_ws/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7f33a5a15c0b]
  [bt] (6) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyObject_MakeTpCall+0x26b) [0x55e60d23da6b]
  [bt] (7) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyEval_EvalFrameDefault+0x4eb6) [0x55e60d2393e6]
  [bt] (8) /home/myid/programs/mambaforge/envs/p3/bin/python3.10(_PyFunction_Vectorcall+0x6c) [0x55e60d24499c]
server-950.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:442] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:492] [c10d - debug] The server socket is attempting to listen on [::]:29500.
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:37454.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:37454.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:45798.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [user-Super-Server]:29500.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [user-Super-Server]:45938.
[I socket.cpp:295] [c10d - debug] The server socket on [::]:29500 has accepted a connection from [::ffff:xxx.xxx.10.17]:45806.
[I socket.cpp:787] [c10d] The client socket has connected to [user-Super-Server]:29500 on [user-Super-Server]:45938.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


==============================
Initializing DGL...
==============================


load ogbn-arxiv
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 1 for part 1
[21:30:09] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[21:30:09] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [xxx.xxx.9.50:30050]...
client-1017.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:45798.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.9.50, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.9.50]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.9.50]:29500 on [user-Super-Server]:45806.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


==============================
Initializing DGL...
==============================


[09:30:07] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[09:30:07] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
client-950.log
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:56172.
[I socket.cpp:624] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (xxx.xxx.10.17, 29500).
[I socket.cpp:699] [c10d - trace] The client socket is attempting to connect to [::ffff:xxx.xxx.10.17]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:xxx.xxx.10.17]:29500 on [user-Super-Server]:56182.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.


==============================
Initializing DGL...
==============================


Warning! Interface: eno1 
IP address not available for interface.
[21:30:17] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[21:30:17] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.

Please make sure no other processes occupy this port.

Please make sure the commands of launching server/client is consistent with generated by launch.py.

@Rhett-Ying @pubu

The training can be launched via DGL’s launch script now, but hangs with my custom script. I think I’ll stick with DGL’s script for now.

Guys, I genuinely appreciate your dedicated efforts in helping me with this issue.

are your aware of what kind of changes you made that makes DGL launch work now?

I’m not sure actually… I’ve been tweaking ip_config.txt and my script referring DGL’s launch.py and launch.sh. Then DGL’s script just work suddenly.

My script, based on torchrun, manages to make connections between 2 servers and 2 clients, but nothing happens afterwards. The torch.distributed.init_process_group seems never return.