My understanding right now DGL distributed module is designed in such a way that we pre-define a number of clients connecting to graph server. And throughout the training it has to be this fix number of clients. Do we have a plan to remove this restriction?
Asking this because we are thinking of linking DGL to an AutoTune system, where the AutoTune system may want to start arbitrary number N of training processes, with up to M in concurrent at any given time. That would require the RPC server be able to connect/disconnect throughout its life cycle.
If there is no such active develop on this. Would you please share some thoughts what needs to be done and pointers, how much effort would be needed, any blocker, etc?
Much appreciated.