Have Problem with Setting Up IP Config

Allow to me state where I am now:

Setups

  • Two machines (Ubuntu 22.04) in the same LAN
  • Two IP addresses in the ip_config.txt file, e.g. 1.1.1.0 for server and 1.1.1.1 for client (tested with and without specifying ports)
  • Tried to launch the training with DGL’s launch.py script or with my script based on torchrun

Results of Using DGL’s Launch Script

Exception raised no matter if ports are specified or not:

/opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on 1.1.1.0:30050 , error: Address already in use

Results of Using My Launch Script

  1. If the ports specified in ip_config.txt (e.g. 30500) are the same as that in torch.distributed.init_process_group(backend="gloo", init_method="tcp://1.1.1.1:30050"), the server complains that the address is already in use.
  2. If the ports specified in ip_config.txt are different from that in ...init_process_group, then there are no errors but just hang. Logs are shown below

Server’s Log

Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[08:34:26] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:34:26] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [1.1.1.0:30050]...

Client’s Log

Warning! Interface: eno1
IP address not available for interface.
[20:34:26] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[20:34:26] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Warning! Interface: eno1
IP address not available for interface.
[20:34:28] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[20:34:28] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.

Any help is much appreciated.

Here are some tips for debuging.

  1. make sure you’re able to ssh from the machine where you run launch.py to all machines listed in ip_config.txt.
  2. make sure a shared workspace is ready such as NFS shown in our example: https://github.com/dmlc/dgl/tree/master/examples/distributed/graphsage. Or are you just running with this example?
  3. Before run launch.py, please kill any related processes on all machines which could left from last run.

Thanks for your response. So far I’ve done:

  1. Two IP addresses of two machines are in ip_config.txt.
  2. Set up shared workspace using sshfs. (Is it okay?)
  3. Run launch.py but it always complains about “node_classification.py: error: unrecognized arguments: --local-rank=0”, but I’ve never passed --local-rank=0 to launch.py and the string “local-rank” does not even exist in my workspace as no results were shown by running rg -F "local-rank".

Edit:

I modified node_classification.py to use parse_known_args instead of parse_args to avoid the error, but the training seems not running. The log (except the deprecation warnings) is shown below:

[14:39:10] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[14:39:10] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Arguments: Namespace(graph_name='ogbn-arxiv', ip_config='ip_config.txt', part_config='/home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json', n_classes=0, backend='gloo', num_gpus=0, num_epochs=30, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=1000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False)
user-Super-Server: Initializing DistDGL.
Warning! Interface: eno2
IP address not available for interface.
Warning! Interface: veth27427c1
IP address not available for interface.
Warning! Interface: vetheace20b
IP address not available for interface.
[02:39:11] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[02:39:11] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Arguments: Namespace(graph_name='ogbn-arxiv', ip_config='ip_config.txt', part_config='/home/myid/ws/py_ws/p3-demo/dataset/partitioned/ogbn-arxiv/ogbn-arxiv.json', n_classes=0, backend='gloo', num_gpus=0, num_epochs=30, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=1000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False)
user-Super-Server: Initializing DistDGL.
Warning! Interface: eno1
IP address not available for interface.
[14:39:11] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[14:39:11] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.

This line is suspicious.

This is a known issue. please update below line: from local_rank to local-rank

parser.add_argument(
        "--local_rank", type=int, help="get rank of the process"
    )

Yeah… There are several interfaces on both machines, and those with IPs specified in ip_config.txt are eno1 (for server) and eno2 (for client). But I’m not sure how to resolve this. I noticed that PyTorch is using GLOO_SOCKET_IFNAME, but it seems that DGL does not depend on similar environment variables.

what do you mean server and client here? please share all the IPs of both machines.

I have the following in my ip_config.txt:

xxx.xxx.10.17 30050
xxx.xxx.9.50 30050

The first IP belongs to the machine where I run launch.py and the interface of it (shown by ifconfig) is eno1. The other IP address’s interface is eno2.

My bash script to invoke launch.py is shown as follows:

ws="/home/myid/ws/py_ws/p3-demo"
name="ogbn-arxiv"

python launch.py \
  --extra_envs "PATH=/home/myid/programs/mambaforge/etc/profile.d/conda.sh:$PATH" \
  --ssh_username myid \
  --workspace    $ws \
  --num_trainers 1 \
  --num_samplers 0 \
  --num_servers  1 \
  --part_config  dataset/partitioned/$name/$name.json \
  --ip_config    ip_config.txt \
  "python $ws/example/node_classification.py \
  --graph_name $name \
  --ip_config ip_config.txt \
  --part_config $ws/dataset/partitioned/$name/$name.json \
  --n_classes 40 \
  --num_epochs 30 \
  --batch_size 1000 \
  --local_rank 0"

the other IP is supposed to belong to the other machine instead of the other interface of same machine.

Yes, the xxx.xxx.9.50 belong to another physical machine. Sorry about the confuse.

please try with this ENV as DGL uses torch.distributed in the background.

I’m not sure if I’m doing it right. As the training script is launched only on one machine, I think I have to add GLOO_SOCKET_IFNAME to launch.py instead of node_classification.py. Specifically, add eno1 to construct_dgl_server_env_vars function and eno2 to construct_dgl_client_env_vars. And the program still hangs after such modification.

By the way, I just successfully ran the PyTorch’s multi-node example (website and code).

Did you set GLOO_SOCKET_IFNAME?

Yes. PyTorch’s example didn’t work either until I set GLOO_SOCKET_IFNAME to proper values. I also changed the backend to of the example from nccl to gloo.

which dataset are you using for train? how long does it hang?

so this is probably the root cause of hang.

you should specify eno1 when submitting job to ip_1(including server and client) and eno2 for the other machine.

I’m using ogbn-arxiv, which is partitioned into 2 parts before training. It hang like forever. Nothing happen after the log says

[17:33:09] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[17:33:09] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.

But the warning (IP address not available for interface) still persists, and I think I’ve setup the ENV (shown by the ssh command constructed by launch.py).

It’s not working.

And I just noticed that there is a fatal error, which I might have missed earlier, right after the launch:

The number of OMP threads per trainer is set to 32
/home/myid/ws/py_ws/p3-demo/example/launch.py:153: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead
  thread.setDaemon(True)
cleanupu process runs
Fatal Python error: Segmentation fault

Current thread 0x00007fed02adf740 (most recent call first):
  File "/home/myid/ws/py_ws/dgl/heterograph_index.py", line 1151 in formats
  File "/home/myid/ws/py_ws/dgl/heterograph.py", line 6176 in formats
  File "/home/myid/ws/py_ws/dgl/distributed/dist_graph.py", line 398 in __init__
  File "/home/myid/ws/py_ws/dgl/distributed/dist_context.py", line 268 in initialize
  File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 357 in main
  File "/home/myid/programs/mambaforge/envs/p3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper
  File "/home/myid/ws/py_ws/p3-demo/example/node_classification.py", line 483 in <module>