Hi there,
I ran the distributed DGL (GitHub - dmlc/dgl: Python package built to ease deep learning on graph, on top of existing DL frameworks.) on a slurm-managed cluster today, but the job failed with the error message “dgl._ffi.base.DGLError: [13:16:40] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to 172.24.192.105:30050”. I requested two nodes in the slurm script. I used the ethernet interface in the experiment. Stdout and stderr are at stderr · GitHub.
I also tried the Infiniband interface later but got a Bus error. Stdout and stderr are at stderr.ib0 · GitHub.
Essentially, DGL sets up servers and clients at the beginning of the job and binds the processes to ports for socket communication. And I got the error at this stage.
To run DistDGL on slurm cluster, I modified /tool/launch.py to use srun to create new processes instead of ssh. The modified script is at https://github.com/K-Wu/IGB-Datasets/blob/main/benchmark/slurm_launcher.py.
I have checked with the supercomputing center and they report that no IP communication restriction were set. Any ideas or suggestions on how to fix this? Thank you.
Best Regards,
Kun