Restriction on fixed number of RPC clients

HuangLED · July 28, 2021, 5:13am

My understanding right now DGL distributed module is designed in such a way that we pre-define a number of clients connecting to graph server. And throughout the training it has to be this fix number of clients. Do we have a plan to remove this restriction?

Asking this because we are thinking of linking DGL to an AutoTune system, where the AutoTune system may want to start arbitrary number N of training processes, with up to M in concurrent at any given time. That would require the RPC server be able to connect/disconnect throughout its life cycle.

If there is no such active develop on this. Would you please share some thoughts what needs to be done and pointers, how much effort would be needed, any blocker, etc?

Much appreciated.

HuangLED · July 29, 2021, 6:32pm

@zhengda1936 Friendly ping.

zhengda1936 · August 2, 2021, 5:45pm

sorry for the late reply. currently, we don’t have plans to remove this restriction even though we also like to remove this restriction. it’ll be great if you can help with this.

HuangLED · August 3, 2021, 7:01pm

Absolutely happy to help, I am now investigating how much effort it requires.

A few things to ask for thoughts/suggestions: 1) what was the main consideration when we introduce this restriction. I’d like to understand a bit history in case this lead to collateral damage to overall system design 2) Now the connection restriction leads to the implication that training happens after ALL clients are connected (guarded by a global barrier). Is there any system wide implication to this barrier?

system · September 2, 2021, 7:02pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.