Allow to me state where I am now:
Setups
- Two machines (Ubuntu 22.04) in the same LAN
- Two IP addresses in the ip_config.txt file, e.g. 1.1.1.0 for server and 1.1.1.1 for client (tested with and without specifying ports)
- Tried to launch the training with DGL’s launch.py script or with my script based on
torchrun
Results of Using DGL’s Launch Script
Exception raised no matter if ports are specified or not:
/opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on 1.1.1.0:30050 , error: Address already in use
Results of Using My Launch Script
- If the ports specified in ip_config.txt (e.g. 30500) are the same as that in
torch.distributed.init_process_group(backend="gloo", init_method="tcp://1.1.1.1:30050")
, the server complains that the address is already in use. - If the ports specified in ip_config.txt are different from that in
...init_process_group
, then there are no errors but just hang. Logs are shown below
Server’s Log
Start to create specified graph formats which may take non-trivial time.
Finished creating specified graph formats.
start graph service on server 0 for part 0
[08:34:26] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[08:34:26] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Server is waiting for connections on [1.1.1.0:30050]...
Client’s Log
Warning! Interface: eno1
IP address not available for interface.
[20:34:26] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[20:34:26] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Warning! Interface: eno1
IP address not available for interface.
[20:34:28] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[20:34:28] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
Any help is much appreciated.