A question w.r.t loading partition graphs during distributed tasks

Hi, I have a huge graph thus I want to take distributed machine learning. If I have a machine with two gpus, I open two IPython windows and run the following codes:
Rank0:

import os
from dgl import distributed as dgldist

os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'server'
os.environ['DGL_SERVER_ID'] = "0"
os.environ['DGL_IP_CONFIG'] = 'ip_config.txt'
os.environ['DGL_NUM_SERVER'] = "1"
os.environ['DGL_NUM_CLIENT'] = "1"
os.environ['DGL_CONF_PATH'] = './data/graph.json'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)

Rank1:

import os
from dgl import distributed as dgldist

os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'sampler'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)

with the following returns:

load graph
start graph service on server 0 for part 0
Wait connections ...
1 clients connected!

and

Machine (0) client (0) connect to server successfuly!

where ip_config.txt:

127.0.0.1 4341

and

import dgl
from dgl import distributed as dgldist

g = dgl.load_graphs('graphs.bin')[0][0]
dgldist.partition_graph(g, 'graph', num_parts=4, out_path='data')

to generate partition graphs.
However, when I run

g = dgldist.DistGraph('data')

It returns:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-dd067b5fd252> in <module>
----> 1 g = dgldist.DistGraph('data')
~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in __init__(self, graph_name, gpb, part_config)
    475             rpc.set_num_client(1)
    476         else:
--> 477             self._init()
    478             # Tell the backup servers to load the graph structure from shared memory.
    479             for server_id in range(self._client.num_servers):

~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in _init(self)
    531         if self._gpb is None:
    532             self._gpb = self._gpb_input
--> 533         self._client.map_shared_data(self._gpb)
    534 
    535     def __getstate__(self):

~/anaconda3/lib/python3.8/site-packages/dgl/distributed/kvstore.py in map_shared_data(self, partition_book)
   1126         """
   1127         # Get all partition policies
-> 1128         for ntype in partition_book.ntypes:
   1129             policy = NodePartitionPolicy(partition_book, ntype)
   1130             self._all_possible_part_policy[policy.policy_str] = policy

AttributeError: 'NoneType' object has no attribute 'ntypes'

In the documentation,

part_config (str, optional) – The path of partition configuration file generated by dgl.distributed.partition.partition_graph(). It’s used in the standalone mode.

Since I have set the distributed environment, the parameter is not required. However in the source code, self._g = _get_graph_from_shared_mem(self.graph_name) returns None. I set the parameter and find the same problem happens again. Could you give me an example that how to take distributed experiments?

  1. For now, the machine number (specified in ip_config.txt) should be same as graph partitions. In your case, you partitioned graph into 4 parts, so 4 lines should be specified in ip_config.txt.
  2. I prefer to launch dist training including servers/clients via launch.py, instead of launching on our own. pls refer to dgl/examples/pytorch/graphsage/experimental at master · dmlc/dgl · GitHub as an example.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.