Have Problem with Setting Up IP Config

Did you set GLOO_SOCKET_IFNAME?

Yes. PyTorch’s example didn’t work either until I set GLOO_SOCKET_IFNAME to proper values. I also changed the backend to of the example from nccl to gloo.

which dataset are you using for train? how long does it hang?

so this is probably the root cause of hang.

you should specify eno1 when submitting job to ip_1(including server and client) and eno2 for the other machine.

I’m using ogbn-arxiv, which is partitioned into 2 parts before training. It hang like forever. Nothing happen after the log says

[17:33:09] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[17:33:09] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.

But the warning (IP address not available for interface) still persists, and I think I’ve setup the ENV (shown by the ssh command constructed by launch.py).

It’s not working.

And I just noticed that there is a fatal error, which I might have missed earlier, right after the launch:

The number of OMP threads per trainer is set to 32
/home/myid/ws/py_ws/p3-demo/example/launch.py:153: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead
  thread.setDaemon(True)
cleanupu process runs
Fatal Python error: Segmentation fault

Current thread 0x00007fed02adf740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/heterograph_index.py", line 1151 in formats
  File "/home/myid/ws/py_ws/dgl/heterograph.py", line 6176 in formats
  File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 398 in __init__
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 268 in initialize
  File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 357 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 483 in <module>

which dgl version are you using? what’s the shared memory you configured for your machines? df -h /dev/shm

I’m using version 1.1.1+cu117.

df -h /dev/shm shows:

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           158G   28M  158G   1% /dev/shm

Do I need to configure shared memory? Like I said, I’m using sshfs to map the workspace from xxx.xxx.10.17 machine to xxx.xxx.9.50 machine.

which machine is the output of df -h /dev/shm from? or these 2 machines shares this file system as well?

Sorry about the confuse.

xxx.xxx.10.17:

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           158G   28M  158G   1% /dev/shm

xxx.xxx.9.50

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            32G  285M   32G   1% /dev/shm

After I launch the training script (the node_classification.py example), xxx.xxx.10.17's used memory increased to 101M and xxx.xxx.9.50's used memory increased to 357M.

shm looks good to me. The callstack you shared indicates seg fault happened when creating DGL formats. could you try to prepend DGL_GRAPH_FORMAT=coo before node_classification.py xxx?

Can’t tell since I have no access to the computation resources currently.

There is something that I want to double check: when running a distributed training on 2 machines, is there 1 server and 1 client on each machine, or 1 server and 1 client on one machine and 1 client on another machine?

--num_trainers 1 --num_samplers 0 --num_servers 1 indicates 1 server and 1 client on each machine. each client will connect to all servers on all machines.

Wow… I thought in general there is only 1 global server and there are 1 or more clients.

Does this mean that, in my case, DGL is using xxx.xxx.10.17:<PORT_1> to start a server and xxx.xxx.9.50:<PORT_2> to start a client, and doing the similar things on xxx.xxx.9.50 machine?

No. Simply put, DGL has server processes (primary and backup), trainer processes, and sampler processes, all running on the same and every machine. When you run the launch script, DGL first launches the server processes specified by --num_servers to load the graph partitions. There is one primary server, the rest are backup servers (optional). When the servers are launched successfully, DGL launches the trainer and sampler processes specified by --num_trainers and --num_samplers. Generally, the number of trainer processes are the same as the number of GPU’s in the machine (for training on GPUs). You can choose to not use sampler processes at all, in which case sampling is done in the trainer processes.

In summary, every machine has one primary server process that loads the graph partition, optionally backup server processes, one or more trainer processes, and zero or more sampler processes, all running on the same machine and every machine in the distributed training.

1 Like

DGL is using xxx.xxx.10.17:<PORT_1> (default is 30050) to start a server. Clients use <PORT_1> to connect to the server running on the same machine and every other machine. The same happens on every other machine. So, there will be a server running on xxx.xxx.9.50:<PORT_1> as well.

@Rhett-Ying @Rhett-Ying

I’m not sure what’s going on, but right now I’m able to make some progress with torchrun (of PyTorch 2.0.1) and the server’s launch script didn’t hang until it reached

Server is waiting for connections on [xxx.xxx.10.17:30050]

Please allow me to show the operations.

Server’s Script

torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        0 \
  --max-restarts   0 \
  --role           server \
  --node-rank      0 \
  --master-addr    xxx.xxx.10.17 \
  --master-port    29500 \
  --local-addr     10.30.10.17 \
  $ws/train_dist.py \
  --role server \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1

Client’s Script

addr=(
  "xxx.xxx.10.17"
  "xxx.xxx.9.50"
)

torchrun \
  --nnodes         2 \
  --nproc-per-node 1 \
  --rdzv-backend   static \
  --rdzv-id        0 \
  --max-restarts   0 \
  --role           client \
  --node-rank      1 \
  --master-addr    xxx.xxx.10.17 \
  --master-port    $port \
  --local-addr     ${addr[$1]} \
  $ws/train_dist.py \
  --role client \
  --ip_config   $ws/script/ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --num_client  1 \
  --num_server  1 \
  --num_sampler 1

Launching

Machine xxx.xxx.10.17

./launch_server.sh
./launch_client.sh 0

Machine xxx.xxx.9.50

./launch_client.sh 1

I might have done something wrong, but I don’t know where is it.

Thanks in advanced.

Edit:
I think I should launch a server on xxx.xxx…9.50 as well. I’ll get back later. :stuck_out_tongue:

No luck though.

Is the distributed topology of DGL the same as that in PyTorch? It seems that the torchrun example (website and code) runs on 1 server and 1 client on 2 different machines.

Once again, your help is much much appreciated. Thanks!

Usually we’d better launch via launch.py and replace torch.distributed.launch with torchrun. As you can see in launch.py, launching by yourself is very error-prone as several envs are required to be set during launch. Please configure export TORCH_DISTRIBUTED_DEBUG=DETAIL TORCH_CPP_LOG_LEVEL=INFO as well.

One major issue in your launch is you need to launch server on both machines, then launch clients on both machines. @pubu has introduced the details of server and client.

In order to figure out the root cause of hang in your case, we could also try to launch manually. You could obtain the launch cmd for server and client via checking processes details ps when you launch via launch.py. You will find cmd like below. DGL_ROLE indicates the process type: server or client.

export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=16 DGL_CONF_PAT ...