I am trying out the distributed graphsage model example under pytorch/graphsage/experimental (with a bit of my own tweaks. For testing purpose, I am starting two graph servers, one partition each; and two trainer processes, from a single machine. In total Four processes. ).
Then one issue I am running into is this:
res = req.process_request(server_state)
File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/dgl-0.7-py3.7-linux-x86_64.egg/dgl/distributed/kvstore.py", line 64, in process_request
raise RuntimeError("KVServer cannot find partition policy with name: %s" % self.name)
RuntimeError: KVServer cannot find partition policy with name: node:_N:h
I believe the reason for this is inference() method used hard code ‘h’ as Tensor name. See the code pointer here: dgl/examples/pytorch/graphsage/experimental/train_dist.py at master · dmlc/dgl · GitHub
So I printed out all the partition policies, they look like this
(pid=360651) {'node:_N:train_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:feat': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:test_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:val_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:features': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:year': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>,
'node:_N:labels': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>}
This does explain why “node:_N:h” cannot be found. There is no partition policy related to ‘h’ in the graph partition. But is this expected? Why we hard-coded a ‘h’ here, is this by design or a bug? (and plus, I also tried just one graph partition, there was no such issue. My guess is one partition does not require a partition policy).