Which ports need to be opened in firewall for DGL Distributed?

DGL distributed doesn’t run when firewall is enabled on hosts even though I have opened port 30050 on all hosts. I also tried specifying port in ip_config.txt file and opened that port in firewall along with port 30050 on all hosts but no joy:

192.168.1.100 10020
192.168.1.101 10020
...

Should the port in the ip_config.txt file be enclosed in brackets like this?

192.168.1.100 [10020]
192.168.1.101 [10020]
...

The training works fine when I disable firewall on all hosts (without specifying port in ip_config.txt), but it doesn’t work with firewall enabled.

Which ports need to be opened in firewall for distributed training?

when the firewall is enabled, does ssh 192.168.1.100 work? and what’s the error log?

yes, ssh works, I have setup passwordless ssh.
Where can I get the error log?

does dgl uses only port 30050 or other ports too?

if any error/exception are thrown, you’re supposed to see them on the terminal where you launch the train job.

30050 is used for servers and free ports will be used for clients via:

So I think you need to disable firewall or clients cannot obtain any free port that servers could access if firewall blocks any ports in default.

Could you share the launch cmd and log too?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.