Partition Policy "node:_N:h" not found in dist graphsage model

HuangLED · May 29, 2021, 12:43am

I am trying out the distributed graphsage model example under pytorch/graphsage/experimental (with a bit of my own tweaks. For testing purpose, I am starting two graph servers, one partition each; and two trainer processes, from a single machine. In total Four processes. ).

Then one issue I am running into is this:

    res = req.process_request(server_state)
  File "/home/centos/anaconda3/envs/dev/lib/python3.7/site-packages/dgl-0.7-py3.7-linux-x86_64.egg/dgl/distributed/kvstore.py", line 64, in process_request
    raise RuntimeError("KVServer cannot find partition policy with name: %s" % self.name)
RuntimeError: KVServer cannot find partition policy with name: node:_N:h

I believe the reason for this is inference() method used hard code ‘h’ as Tensor name. See the code pointer here: dgl/examples/pytorch/graphsage/experimental/train_dist.py at master · dmlc/dgl · GitHub

So I printed out all the partition policies, they look like this

(pid=360651) {'node:_N:train_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:feat': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:test_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:val_mask': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:features': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:year': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>, 
'node:_N:labels': <dgl.distributed.graph_partition_book.PartitionPolicy object at 0x7f483f4abed0>}

This does explain why “node:_N:h” cannot be found. There is no partition policy related to ‘h’ in the graph partition. But is this expected? Why we hard-coded a ‘h’ here, is this by design or a bug? (and plus, I also tried just one graph partition, there was no such issue. My guess is one partition does not require a partition policy).

zhengda1936 · June 2, 2021, 5:43am

this is weird. the partition policy isn’t “node:_N”. We have a partition policy for each node type and edge type.
The error message is quite misleading.
I’m not why you get this error. do you get this error by just running our distributed GraphSage code?

HuangLED · June 3, 2021, 1:34am

I created a version mostly based on graphsage train_dist.py.

I know my setup is not typical, but the goal was to help me understand how the distributed modules interact with each other, before moving to a realistic multiple machine setting. Here is my setup:

Two graph store server, each with one partition of graph. One process each.
Two trainer process.
everything happens on one single machine

In total there are 4 processes on one single machine. Eveything else (i.e. model related implementation) pretty much followed train_dist.py. Conceptually I assumed this minimized version should work. Maybe I was wrong? Is there any reason we shouldn’t do this?

besides the “Node:_N:h” missing error. I also ran into a lot of issue where kvstore.fast_pull()'s check failure(e.g. rpc/rpc.cc:422: Check failed: l_id < local_data_shape[0]). The training seems ok, but issues always come from the evaluation part. Looks like train_dist has a specialized implementation inference, which causes troubles. I also tried to hard code and bypass fast_pull, then everything works better (but not every time).

zhengda1936 · June 8, 2021, 6:12am

our implementation wasn’t designed to have distributed training with 2 graph partitions on a single machine because we don’t see any benefit of doing it. In the same machine, all trainers can access the graph directly, why do they need to access two physical graph partitions? If you really want to have two partitions and one trainer to work on one of them, you can put partition IDs on nodes so that trainers know which nodes belong to itself. We actually have this implementation, but the code isn’t made available yet.

My guess is that the missing “Node:_N:h” error is caused by the conflicts of shared memory names. We actually use shared memory for communication between processes in the same machine. All tensors or graph data are provided with a name following certain rules. Since you want to run two graph partitions in the same machine, the servers try to use the same name for shared memory.

I hope this explanation makes sense to you. I think you should try having one graph partition for each machine.

HuangLED · June 9, 2021, 4:51pm

Sure. This makes perfect sense.

system · July 9, 2021, 4:51pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.