Client_loop: send disconnect: broken pipe after 10 minutes in DistDGL [Resolved?]

l-hoang · January 25, 2023, 5:47pm

Hello.

I’ve been trying to run DistDGL with symmetric ogbn-papers100M on a 4 host GCP CPU cluster, and things get stuck at the loading step. The error trace is below:

[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.                                                                                                                             [0/1389][16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:140: Sender with NetType~socket is created.
[16:47:50] /opt/dgl/src/rpc/rpc.cc:159: Receiver with NetType~socket is created.
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.9 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.8 closed by remote host.^M
[16:57:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
client_loop: send disconnect: Broken pipe^M
Connection to 10.0.0.8 closed by remote host.^M
Connection to 10.0.0.9 closed by remote host.^M
client_loop: send disconnect: Broken pipe^M
[17:07:50] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.0.0.8:30050
Connection to 10.0.0.10 closed by remote host.^M

After exactly 10 minutes, it seems like a timeout kicks in and it tries to connect to another machine, and the connection ultimately gets closed off as seen in the log.

At this point, I’m not sure if progress is still being made or if a deadlock has occurred. Checking the state of the processes suggests that some progress is being made (python processes on each of the 4 hosts still seem to be alive, and they all hold 50GB of memory presumably from the partitioned graph). The last two lines in stdout are the following.

Client [5790] waits on 10.0.0.10:56973
Machine (1) group (0) client (0) connect to server successfuly!

I have 2 questions:

Is it normal for a connection to be terminated 10 minutes after the initial socket creation?
If not, is there something I can do to prevent the connection from terminating? Maybe it’s in a deadlocked state because the connections are deleted before the read of the partitions is done.

Thank you.

l-hoang · January 25, 2023, 6:36pm

The solution (from what I can tell so far): the cluster was timing out the connection during the read phase, so I changed the ssh server timeout value with ServerAliveInterval to be less than 10 minutes.

Rhett-Ying · January 28, 2023, 11:20am

Such a log is printed every 600s(10min), see details here.

github.com

dmlc/dgl/blob/4b5fa83bcd36645dbe652e33902b0969650c10fc/src/rpc/network/socket_communicator.cc#L74-L85

      
        
            while (bo == false && try_count < max_try_times) {
              if (client_socket->Connect(ip, port)) {
                bo = true;
              } else {
                if (try_count % 200 == 0 && try_count != 0) {
                  // every 600 seconds show this message
                  LOG(INFO) << "Trying to connect receiver: " << ip << ":" << port;
                }
                try_count++;
                std::this_thread::sleep_for(std::chrono::seconds(3));
              }
            }

This issue is probably caused by the long time severs take to load the partitioned graphs. please try to increase DGL_DIST_MAX_TRY_TIMES. see details here:

github.com

dmlc/dgl/blob/4b5fa83bcd36645dbe652e33902b0969650c10fc/python/dgl/distributed/rpc_client.py#L187

      
        
                    max_machine_id = server_info[0]
            rpc.set_num_server_per_machine(group_count[0])
            num_machines = max_machine_id + 1
            rpc.set_num_machines(num_machines)
            machine_id = get_local_machine_id(server_namebook)
            rpc.set_machine_id(machine_id)
            rpc.set_group_id(group_id)
            rpc.create_sender(max_queue_size, net_type)
            rpc.create_receiver(max_queue_size, net_type)
            # Get connected with all server nodes
            max_try_times = int(os.environ.get("DGL_DIST_MAX_TRY_TIMES", 1024))
            for server_id, addr in server_namebook.items():
                server_ip = addr[1]
                server_port = addr[2]
                try_times = 0
                while not rpc.connect_receiver(server_ip, server_port, server_id):
                    try_times += 1
                    if try_times % 200 == 0:
                        print(
                            "Client is trying to connect server receiver: {}:{}".format(
                                server_ip, server_port

system · February 27, 2023, 11:20am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.