How to setup sampler client role correctly?

I have DGL working perfectly fine in a distributed setting using default num_worker=0 (which does sampler without a pool my understanding). Now I am extending it to using multiple samplers for higher sampling throughput.

In the server process, I did this:

start_server():
os.environ[“DGL_DIST_MODE”] = “distributed”
os.environ[“DGL_ROLE”] = “server”
os.environ[“DGL_SERVER_ID”] = str(self._rank)
g = DistGraphServer( … , disable_shared_mem=False,)
g.start()

In the training/client process, I did this:

def sagemain(ip_config_file, local_partition_file, args, rank):
os.environ[“DGL_DIST_MODE”] = “distributed”
os.environ[“DGL_ROLE”] = “client”
os.environ[‘DGL_NUM_SAMPLER’] = “3”
dgl.distributed.initialize(ip_config_file, num_worker=3)

pb, _, _, _ = load_partition_book(local_partition_file, rank)
g = DistGraph(args.graph_name, gpb=pb)

model training starts from here

I pretty much followed what is done in launch.py. Then we run into error like this, complaining about key missing for ‘default’ role:

What needs to be done for this ‘default’ role? Searched doc and code but didn’t find anything helpful so far. In general, how to set up multiple-sampler properly for distributed environment? Any suggestions?

Thanks a lot!

Hi,

What’s your dgl version? Ideally the sampler role will be initiated at dgl/dist_context.py at 8bc91414e94f3f6bc9ce3e191a519a3973649487 · dmlc/dgl · GitHub

I am using master head.

Although the location you pointed sets “sampler” role, there is this place in role.py (dgl/role.py at master · dmlc/dgl · GitHub) that uses a hard coded “default” then look for related information. I believe that is the reason why this error and crash.

If it is required, where should I setup such a “default” role properly?

Thanks!

It should be set to default if DGL_ROLE is not server

a bit confused here. Then What is the difference between ‘client’ and ‘default’ ?

I thought the role should be ‘client’ because the code in launch script say so: dgl/launch.py at master · dmlc/dgl · GitHub

I agree this is confusing. May fix this in the future.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.