Did you set GLOO_SOCKET_IFNAME
?
Yes. PyTorch’s example didn’t work either until I set GLOO_SOCKET_IFNAME
to proper values. I also changed the backend to of the example from nccl
to gloo
.
which dataset are you using for train? how long does it hang?
so this is probably the root cause of hang.
you should specify eno1
when submitting job to ip_1(including server and client) and eno2
for the other machine.
I’m using ogbn-arxiv
, which is partitioned into 2 parts before training. It hang like forever. Nothing happen after the log says
[17:33:09] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[17:33:09] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
But the warning (IP address not available for interface
) still persists, and I think I’ve setup the ENV (shown by the ssh command constructed by launch.py
).
It’s not working.
And I just noticed that there is a fatal error, which I might have missed earlier, right after the launch:
The number of OMP threads per trainer is set to 32
/home/myid/ws/py_ws/p3-demo/example/launch.py:153: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead
thread.setDaemon(True)
cleanupu process runs
Fatal Python error: Segmentation fault
Current thread 0x00007fed02adf740 (most recent call first):
File "/home/myid/ws/py_ws/dgl/heterograph_index.py", line 1151 in formats
File "/home/myid/ws/py_ws/dgl/heterograph.py", line 6176 in formats
File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 398 in __init__
File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 268 in initialize
File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 357 in main
File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 483 in <module>
which dgl version are you using? what’s the shared memory you configured for your machines? df -h /dev/shm
I’m using version 1.1.1+cu117
.
df -h /dev/shm
shows:
Filesystem Size Used Avail Use% Mounted on
tmpfs 158G 28M 158G 1% /dev/shm
Do I need to configure shared memory? Like I said, I’m using sshfs
to map the workspace from xxx.xxx.10.17
machine to xxx.xxx.9.50
machine.
which machine is the output of df -h /dev/shm
from? or these 2 machines shares this file system as well?
Sorry about the confuse.
xxx.xxx.10.17:
Filesystem Size Used Avail Use% Mounted on
tmpfs 158G 28M 158G 1% /dev/shm
xxx.xxx.9.50
Filesystem Size Used Avail Use% Mounted on
tmpfs 32G 285M 32G 1% /dev/shm
After I launch the training script (the node_classification.py example), xxx.xxx.10.17
's used memory increased to 101M and xxx.xxx.9.50
's used memory increased to 357M.
shm looks good to me. The callstack you shared indicates seg fault happened when creating DGL formats. could you try to prepend DGL_GRAPH_FORMAT=coo
before node_classification.py xxx
?
Can’t tell since I have no access to the computation resources currently.
There is something that I want to double check: when running a distributed training on 2 machines, is there 1 server and 1 client on each machine, or 1 server and 1 client on one machine and 1 client on another machine?
--num_trainers 1 --num_samplers 0 --num_servers 1
indicates 1 server and 1 client on each machine. each client will connect to all servers on all machines.
Wow… I thought in general there is only 1 global server and there are 1 or more clients.
Does this mean that, in my case, DGL is using xxx.xxx.10.17:<PORT_1>
to start a server and xxx.xxx.9.50:<PORT_2>
to start a client, and doing the similar things on xxx.xxx.9.50
machine?
No. Simply put, DGL has server processes (primary and backup), trainer processes, and sampler processes, all running on the same and every machine. When you run the launch script, DGL first launches the server processes specified by --num_servers
to load the graph partitions. There is one primary server, the rest are backup servers (optional). When the servers are launched successfully, DGL launches the trainer and sampler processes specified by --num_trainers
and --num_samplers
. Generally, the number of trainer processes are the same as the number of GPU’s in the machine (for training on GPUs). You can choose to not use sampler processes at all, in which case sampling is done in the trainer processes.
In summary, every machine has one primary server process that loads the graph partition, optionally backup server processes, one or more trainer processes, and zero or more sampler processes, all running on the same machine and every machine in the distributed training.
DGL is using xxx.xxx.10.17:<PORT_1>
(default is 30050) to start a server. Clients use <PORT_1>
to connect to the server running on the same machine and every other machine. The same happens on every other machine. So, there will be a server running on xxx.xxx.9.50:<PORT_1>
as well.
I’m not sure what’s going on, but right now I’m able to make some progress with torchrun
(of PyTorch 2.0.1) and the server’s launch script didn’t hang until it reached
Server is waiting for connections on [xxx.xxx.10.17:30050]
Please allow me to show the operations.
Server’s Script
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 0 \
--max-restarts 0 \
--role server \
--node-rank 0 \
--master-addr xxx.xxx.10.17 \
--master-port 29500 \
--local-addr 10.30.10.17 \
$ws/train_dist.py \
--role server \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Client’s Script
addr=(
"xxx.xxx.10.17"
"xxx.xxx.9.50"
)
torchrun \
--nnodes 2 \
--nproc-per-node 1 \
--rdzv-backend static \
--rdzv-id 0 \
--max-restarts 0 \
--role client \
--node-rank 1 \
--master-addr xxx.xxx.10.17 \
--master-port $port \
--local-addr ${addr[$1]} \
$ws/train_dist.py \
--role client \
--ip_config $ws/script/ip_config.txt \
--part_config $ws/dataset/partitioned/$name/$name.json \
--num_client 1 \
--num_server 1 \
--num_sampler 1
Launching
Machine xxx.xxx.10.17
./launch_server.sh
./launch_client.sh 0
Machine xxx.xxx.9.50
./launch_client.sh 1
I might have done something wrong, but I don’t know where is it.
Thanks in advanced.
Edit:
I think I should launch a server on xxx.xxx…9.50 as well. I’ll get back later.
No luck though.
Is the distributed topology of DGL the same as that in PyTorch? It seems that the torchrun
example (website and code) runs on 1 server and 1 client on 2 different machines.
Once again, your help is much much appreciated. Thanks!
Usually we’d better launch via launch.py
and replace torch.distributed.launch
with torchrun
. As you can see in launch.py
, launching by yourself is very error-prone as several envs are required to be set during launch. Please configure export TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO
as well.
One major issue in your launch is you need to launch server on both machines, then launch clients on both machines. @pubu has introduced the details of server and client.
In order to figure out the root cause of hang in your case, we could also try to launch manually. You could obtain the launch cmd for server and client via checking processes details ps
when you launch via launch.py
. You will find cmd like below. DGL_ROLE
indicates the process type: server or client.
export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PAT ...