Difference between fast_pull and non-fast-pull?

Hey, Folks,

Trying to understand how fast_pull works, and when it is applicable.

It is introduced in this PR ([KVStore] Add fast-pull for kvstore by aksnzhy · Pull Request #1647 · dmlc/dgl · GitHub), but I don’t find much high descriptions in this PR. Can someone give some hints on why and when fast_pull is needed?

My understanding, the difference between fast_pull versus non-fast-pull is simply the fast version is pure c++ code, therefore faster (wrong?). Though I do fall into a situation where fast_pull check fails, and I have to bypass fast-pull and use the regular pull routine then everything works.

Thanks!

You are right. Fast_pull is the cpp version of pull. What’s the problem you’ve met?

The error is like this

   File "/home/centos/byted/bytegnn/poc/sage.py", line 315, in main
    train_nid = dgl.distributed.node_split(g.ndata["train_mask"], pb, force_even=True)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_graph.py", line 1059, in node_split
    return _split_even(partition_book, rank, nodes)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_graph.py", line 983, in _split_even
    eles = F.nonzero_1d(elements[0:len(elements)])
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_tensor.py", line 167, in __getitem__
    return self.kvstore.pull(name=self._name, id_tensor=idx)
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/kvstore.py", line 1192, in pull
    return rpc.fast_pull(name, id_tensor, part_id, KVSTORE_PULL,
  File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/rpc.py", line 975, in fast_pull
    res_tensor = _CAPI_DGLRPCFastPull(name,
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:43:06] /opt/dgl/src/rpc/rpc.cc:422: Check failed: l_id < local_data_shape[0] (18446744073709468015 vs. 83601) :

I am trying out current distributed setting in a non-typical configuration, with the goal of better understanding how each distributed module work with each other. This config goes like this: two graph servers (one partition each), and two pytorch trainer processes, in total 4 processes on a single local machine. That being said, the error occurs even before the training starts. My guess, it is caused by having two graph server/partitions on a single machine.

As mentioned above, if I by-pass fast-pull, always use slow-pull, then everything works. Given they are supposed to be equivalent, this means there is a bug in in fast_pull implementation?

Did you use different network layers? It seems the message is delivered out of order

1 Like

Thank you! Allen.

This issue seems to have same root cause to this problem (Partition Policy "node:_N:h" not found in dist graphsage model). In that post, Da said right now by design we are not supposed to create two graph servers on one SINGLE machine.