Questions on parameters num_servers and num_clients

Hi, Folks,

I am trying to understand these two concepts by reading code, but got a few questions. Can someone give a few hints?

  1. The code documented here (dgl/dist_context.py at master · dmlc/dgl · GitHub) says that both num_servers and num_client are deprecated (already?). If so, which alternative configurations one should use for the behaviors as described?

  2. Question on num_sever [Assuming it is not deprecated de-facto] . If there are multiple (say 10) server instances on one machine, and they all refer to the same graph partition, are they supposed to have each of their own server_id, as well as ip:port in config file? (confused partially because there is also ‘backup’ server, for which I couldn’t tell if they exist in an agnostic way or needs to be explicitly specified with an ip+port).

  3. I am confused by the difference between ‘DGL_NUM_CLIENTS’ and “DGL_NUM_SAMPLER”, can you explain a bit? (another example is inside “connect_to_server()” method, client is also registering RPC service, which blurs the roles among them. dgl/rpc_client.py at master · dmlc/dgl · GitHub. Does that mean in fact ‘client’ has both role of client and server at the same time?)

Thanks a lot.

dgl.distributed.initialize sets up the number of servers and the number of workers with num_servers and num_workers on the client side. However, this is actually redundant with the parameters in the launch script. Therefore, we decided to deprecate the parameters in the initialize function.

if there are multiple servers running at the same time, they will have their own server IDs and ports. but you don’t need to specify the port for each of the servers in the config file. If I remember it correctly, the system will pick the right port for the backup servers.

DGL_NUM_CLIENTS are the number of clients in a machine, which includes the sampler processes and the trainers, and DGL_NUM_SAMPLER are the number of sampler processes in a machine.
Each process has only a single role. the client needs to register RPC services so that the callback function can be invoked. It doesn’t mean that the client acts as a server as well.

1 Like

One follow up question. I am a bit confused by the constraint between num_client, num_trainer and num_sampler.

Several places in the code suggest num_client and num_trainer should be the same, is this true? For example code here: dgl/dist_graph.py at master · dmlc/dgl · GitHub

actually, the number of clients and the number of trainers are different. num_client=num_trainer+num_sampler.
that’s why we decided to remove this argument in dgl.distributed.initialize so that users don’t need to specify it.

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.