all right. I suppose it’s network configuration issue.
It seems that the reason I could make it run was that I launched the servers on the same machine with different ports, i.e. putting 1 IP address (i.e. xxx.xxx.10.17
) with 2 different ports into ip_config.txt.
Then I met this error today, so I modified ip_config.txt and put 2 different IP address into it (i.e. xxx.xxx.10.17
and xxx.xxx.9.50
).
Then I got this error:
[08:38:41] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on xxx.xxx.9.50:30051 , error: Cannot assign requested address
I’m not sure if it is a network configuration issue on my side.
I’d recommend using DGL launch script with torchrun
and not specifying any ports and/or duplicate IPs in the ip_config.txt
file. Were you able to make the launch script work with torchrun
? If not, I can send you the launch script that I made work with torchrun
.
It will be great if you can send me a copy of your launch script! I’ll message you my email address. Thanks.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.