Hi, I have a huge graph thus I want to take distributed machine learning. If I have a machine with two gpus, I open two IPython
windows and run the following codes:
Rank0:
import os
from dgl import distributed as dgldist
os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'server'
os.environ['DGL_SERVER_ID'] = "0"
os.environ['DGL_IP_CONFIG'] = 'ip_config.txt'
os.environ['DGL_NUM_SERVER'] = "1"
os.environ['DGL_NUM_CLIENT'] = "1"
os.environ['DGL_CONF_PATH'] = './data/graph.json'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)
Rank1:
import os
from dgl import distributed as dgldist
os.environ['DGL_DIST_MODE'] = 'distributed'
os.environ['DGL_ROLE'] = 'sampler'
dgldist.initialize(ip_config='ip_config.txt', num_servers=1, num_workers=0)
with the following returns:
load graph
start graph service on server 0 for part 0
Wait connections ...
1 clients connected!
and
Machine (0) client (0) connect to server successfuly!
where ip_config.txt
:
127.0.0.1 4341
and
import dgl
from dgl import distributed as dgldist
g = dgl.load_graphs('graphs.bin')[0][0]
dgldist.partition_graph(g, 'graph', num_parts=4, out_path='data')
to generate partition graphs.
However, when I run
g = dgldist.DistGraph('data')
It returns:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-35-dd067b5fd252> in <module>
----> 1 g = dgldist.DistGraph('data')
~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in __init__(self, graph_name, gpb, part_config)
475 rpc.set_num_client(1)
476 else:
--> 477 self._init()
478 # Tell the backup servers to load the graph structure from shared memory.
479 for server_id in range(self._client.num_servers):
~/anaconda3/lib/python3.8/site-packages/dgl/distributed/dist_graph.py in _init(self)
531 if self._gpb is None:
532 self._gpb = self._gpb_input
--> 533 self._client.map_shared_data(self._gpb)
534
535 def __getstate__(self):
~/anaconda3/lib/python3.8/site-packages/dgl/distributed/kvstore.py in map_shared_data(self, partition_book)
1126 """
1127 # Get all partition policies
-> 1128 for ntype in partition_book.ntypes:
1129 policy = NodePartitionPolicy(partition_book, ntype)
1130 self._all_possible_part_policy[policy.policy_str] = policy
AttributeError: 'NoneType' object has no attribute 'ntypes'
In the documentation,
part_config (str, optional) – The path of partition configuration file generated by
dgl.distributed.partition.partition_graph()
. It’s used in the standalone mode.
Since I have set the distributed
environment, the parameter is not required. However in the source code, self._g = _get_graph_from_shared_mem(self.graph_name)
returns None
. I set the parameter and find the same problem happens again. Could you give me an example that how to take distributed experiments?