Hello.
I’ve been trying to run DistDGL with symmetric ogbn-papers100M on a 4 host GCP CPU cluster, and things get stuck at the loading step. The error trace is below:
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created. [0/1389][16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.9 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.8 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
client_loop: send disconnect: Broken pipe^M
Connection to 10.0.0.8 closed by remote host.^M
Connection to 10.0.0.9 closed by remote host.^M
client_loop: send disconnect: Broken pipe^M
[17:07:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.10 closed by remote host.^M
After exactly 10 minutes, it seems like a timeout kicks in and it tries to connect to another machine, and the connection ultimately gets closed off as seen in the log.
At this point, I’m not sure if progress is still being made or if a deadlock has occurred. Checking the state of the processes suggests that some progress is being made (python processes on each of the 4 hosts still seem to be alive, and they all hold 50GB of memory presumably from the partitioned graph). The last two lines in stdout are the following.
Client [5790] waits on 10.0.0.10:56973
Machine (1) group (0) client (0) connect to server successfuly!
I have 2 questions:
- Is it normal for a connection to be terminated 10 minutes after the initial socket creation?
- If not, is there something I can do to prevent the connection from terminating? Maybe it’s in a deadlocked state because the connections are deleted before the read of the partitions is done.
Thank you.