Client barrier incomplete during exit_client

vipgupta · September 13, 2021, 9:54pm

Hi,

I am using DGL with an AutoML framework (specifically RayTune). I was able to integrate the two frameworks except for one problem: the distributed clients don’t release CPU resources after the tuning process is complete. After debugging, I found that the ClientBarrierRequest() is incomplete when the clients call exit_client() before exiting (most likely due to the blocking recv_response call by the clients). Not that these clients are independent tuning trials and do not require synchronization. Further, these clients are running for different amounts of time since each trial is running with a separate set of hyperparameters (e.g. different batch-sizes).

Any solution in regard to clean stale processes would be very much appreciated. Thanks a lot.

BarclayII · September 14, 2021, 11:12am

Just to make sure: seems that you were using RayTune together with distributed training?

vipgupta · September 14, 2021, 5:34pm

Hi, I am actually not using distributed training. Each client is running on one CPU core and independently of other clients and evaluating one set of hyperparameters. However, I am using a graphserver to connect to these clients.

VoVAllen · September 15, 2021, 6:10am

Hi,

This is something we are planning for the next release, that a server can be independent of the client. Currently servers and clients are binded, which means you cannot use one server group for multiple group of clients.

vipgupta · September 16, 2021, 9:57pm

Thanks for the quick response. We were wondering if this feature going to be available soon at the DGL master, or if we can contribute to the code to accelerate the process?

VoVAllen · September 22, 2021, 10:23am

copy @zhengda1936 for visibility

zhengda1936 · September 22, 2021, 3:13pm

@vipgupta i don’t think it’ll be available very soon. it’ll be great if you can contribute this to DGL.

system · October 22, 2021, 3:13pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.