Client_loop: send disconnect: broken pipe after 10 minutes in DistDGL [Resolved?]

Hello.

I’ve been trying to run DistDGL with symmetric ogbn-papers100M on a 4 host GCP CPU cluster, and things get stuck at the loading step. The error trace is below:

[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.                                                                                                                             [0/1389][16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.9 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.8 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
client_loop: send disconnect: Broken pipe^M
Connection to 10.0.0.8 closed by remote host.^M
Connection to 10.0.0.9 closed by remote host.^M
client_loop: send disconnect: Broken pipe^M
[17:07:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.10 closed by remote host.^M

After exactly 10 minutes, it seems like a timeout kicks in and it tries to connect to another machine, and the connection ultimately gets closed off as seen in the log.

At this point, I’m not sure if progress is still being made or if a deadlock has occurred. Checking the state of the processes suggests that some progress is being made (python processes on each of the 4 hosts still seem to be alive, and they all hold 50GB of memory presumably from the partitioned graph). The last two lines in stdout are the following.

Client [5790] waits on 10.0.0.10:56973
Machine (1) group (0) client (0) connect to server successfuly!

I have 2 questions:

  1. Is it normal for a connection to be terminated 10 minutes after the initial socket creation?
  2. If not, is there something I can do to prevent the connection from terminating? Maybe it’s in a deadlocked state because the connections are deleted before the read of the partitions is done.

Thank you.

The solution (from what I can tell so far): the cluster was timing out the connection during the read phase, so I changed the ssh server timeout value with ServerAliveInterval to be less than 10 minutes.

Such a log is printed every 600s(10min), see details here.

This issue is probably caused by the long time severs take to load the partitioned graphs. please try to increase DGL_DIST_MAX_TRY_TIMES. see details here:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.