A question w.r.t loading partition graphs during distributed tasks

sakuraiiiii · October 27, 2021, 2:14am

Hi, I have a huge graph thus I want to take distributed machine learning. If I have a machine with two gpus, I open two IPython windows and run the following codes:
Rank0:

import os
from dgl import distributed as dgldist

os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'server'
os.environ['DGL_SERVER_ID'] = "0"
os.environ['DGL_IP_CONFIG'] = 'ip_config.txt'
os.environ['DGL_NUM_SERVER'] = "1"
os.environ['DGL_NUM_CLIENT'] = "1"
os.environ['DGL_CONF_PATH'] = './data/graph.json'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)

Rank1:

import os
from dgl import distributed as dgldist

os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'sampler'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)

with the following returns:

load graph
start graph service on server 0 for part 0
Wait connections ...
1 clients connected!

and

Machine (0) client (0) connect to server successfuly!

where ip_config.txt:

127.0.0.1 4341

and

import dgl
from dgl import distributed as dgldist

g = dgl.load_graphs('graphs.bin')[0][0]
dgldist.partition_graph(g, 'graph', num_parts=4, out_path='data')

to generate partition graphs.
However, when I run

g = dgldist.DistGraph('data')

It returns:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-dd067b5fd252> in <module>
----> 1 g = dgldist.DistGraph('data')
~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in __init__(self, graph_name, gpb, part_config)
    475             rpc.set_num_client(1)
    476         else:
--> 477             self._init()
    478             # Tell the backup servers to load the graph structure from shared memory.
    479             for server_id in range(self._client.num_servers):

~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in _init(self)
    531         if self._gpb is None:
    532             self._gpb = self._gpb_input
--> 533         self._client.map_shared_data(self._gpb)
    534 
    535     def __getstate__(self):

~/anaconda3/lib/python3.8/site-packages/dgl/distributed/kvstore.py in map_shared_data(self, partition_book)
   1126         """
   1127         # Get all partition policies
-> 1128         for ntype in partition_book.ntypes:
   1129             policy = NodePartitionPolicy(partition_book, ntype)
   1130             self._all_possible_part_policy[policy.policy_str] = policy

AttributeError: 'NoneType' object has no attribute 'ntypes'

In the documentation,

part_config (str, optional) – The path of partition configuration file generated by dgl.distributed.partition.partition_graph(). It’s used in the standalone mode.

Since I have set the distributed environment, the parameter is not required. However in the source code, self._g = _get_graph_from_shared_mem(self.graph_name) returns None. I set the parameter and find the same problem happens again. Could you give me an example that how to take distributed experiments?

Rhett-Ying · November 1, 2021, 6:58am

For now, the machine number (specified in ip_config.txt) should be same as graph partitions. In your case, you partitioned graph into 4 parts, so 4 lines should be specified in ip_config.txt.
I prefer to launch dist training including servers/clients via launch.py, instead of launching on our own. pls refer to dgl/examples/pytorch/graphsage/experimental at master · dmlc/dgl · GitHub as an example.

system · December 1, 2021, 6:59am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.