I am using DGL with an AutoML framework (specifically RayTune). I was able to integrate the two frameworks except for one problem: the distributed clients don’t release CPU resources after the tuning process is complete. After debugging, I found that the ClientBarrierRequest() is incomplete when the clients call exit_client() before exiting (most likely due to the blocking recv_response call by the clients). Not that these clients are independent tuning trials and do not require synchronization. Further, these clients are running for different amounts of time since each trial is running with a separate set of hyperparameters (e.g. different batch-sizes).
Any solution in regard to clean stale processes would be very much appreciated. Thanks a lot.