Client barrier incomplete during exit_client

Hi,

I am using DGL with an AutoML framework (specifically RayTune). I was able to integrate the two frameworks except for one problem: the distributed clients don’t release CPU resources after the tuning process is complete. After debugging, I found that the ClientBarrierRequest() is incomplete when the clients call exit_client() before exiting (most likely due to the blocking recv_response call by the clients). Not that these clients are independent tuning trials and do not require synchronization. Further, these clients are running for different amounts of time since each trial is running with a separate set of hyperparameters (e.g. different batch-sizes).

Any solution in regard to clean stale processes would be very much appreciated. Thanks a lot.

1 Like

Just to make sure: seems that you were using RayTune together with distributed training?

Hi, I am actually not using distributed training. Each client is running on one CPU core and independently of other clients and evaluating one set of hyperparameters. However, I am using a graphserver to connect to these clients.

Hi,

This is something we are planning for the next release, that a server can be independent of the client. Currently servers and clients are binded, which means you cannot use one server group for multiple group of clients.

2 Likes

Thanks for the quick response. We were wondering if this feature going to be available soon at the DGL master, or if we can contribute to the code to accelerate the process?

copy @zhengda1936 for visibility

@vipgupta i don’t think it’ll be available very soon. it’ll be great if you can contribute this to DGL.