How to setup sampler client role correctly?

HuangLED · August 16, 2021, 1:02am

I have DGL working perfectly fine in a distributed setting using default num_worker=0 (which does sampler without a pool my understanding). Now I am extending it to using multiple samplers for higher sampling throughput.

In the server process, I did this:

start_server():
os.environ[“DGL_DIST_MODE”] = “distributed”
os.environ[“DGL_ROLE”] = “server”
os.environ[“DGL_SERVER_ID”] = str(self._rank)
g = DistGraphServer( … , disable_shared_mem=False,)
g.start()

In the training/client process, I did this:

def sagemain(ip_config_file, local_partition_file, args, rank):
os.environ[“DGL_DIST_MODE”] = “distributed”
os.environ[“DGL_ROLE”] = “client”
os.environ[‘DGL_NUM_SAMPLER’] = “3”
dgl.distributed.initialize(ip_config_file, num_worker=3)
pb, _, _, _ = load_partition_book(local_partition_file, rank)
g = DistGraph(args.graph_name, gpb=pb)

model training starts from here

I pretty much followed what is done in launch.py. Then we run into error like this, complaining about key missing for ‘default’ role:

What needs to be done for this ‘default’ role? Searched doc and code but didn’t find anything helpful so far. In general, how to set up multiple-sampler properly for distributed environment? Any suggestions?

Thanks a lot!

VoVAllen · August 16, 2021, 5:37am

Hi,

What’s your dgl version? Ideally the sampler role will be initiated at dgl/dist_context.py at 8bc91414e94f3f6bc9ce3e191a519a3973649487 · dmlc/dgl · GitHub

HuangLED · August 16, 2021, 5:55am

I am using master head.

Although the location you pointed sets “sampler” role, there is this place in role.py (dgl/role.py at master · dmlc/dgl · GitHub) that uses a hard coded “default” then look for related information. I believe that is the reason why this error and crash.

If it is required, where should I setup such a “default” role properly?

Thanks!

VoVAllen · August 16, 2021, 6:48am

github.com

dmlc/dgl/blob/8bc91414e94f3f6bc9ce3e191a519a3973649487/python/dgl/distributed/dist_context.py#L262

    
      
                  if num_workers > 0 and not is_standalone:
                      SAMPLER_POOL = CustomPool(num_workers, (ip_config, num_servers, max_queue_size,
                                                              net_type, 'sampler', num_worker_threads))
                  else:
                      SAMPLER_POOL = None
                  NUM_SAMPLER_WORKERS = num_workers
                  if not is_standalone:
                      assert num_servers is not None and num_servers > 0, \
                          'The number of servers per machine must be specified with a positive number.'
                      connect_to_server(ip_config, num_servers, max_queue_size, net_type)
                  init_role('default')
                  init_kvstore(ip_config, num_servers, 'default')
          
          

          
def finalize_client():
              """Release resources of this client."""
              if os.environ.get('DGL_DIST_MODE', 'standalone') != 'standalone':
                  rpc.finalize_sender()
                  rpc.finalize_receiver()
              global INITIALIZED
              INITIALIZED = False

It should be set to default if DGL_ROLE is not server

HuangLED · August 16, 2021, 4:48pm

a bit confused here. Then What is the difference between ‘client’ and ‘default’ ?

I thought the role should be ‘client’ because the code in launch script say so: dgl/launch.py at master · dmlc/dgl · GitHub

VoVAllen · August 17, 2021, 2:13am

I agree this is confusing. May fix this in the future.

system · September 16, 2021, 2:13am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.