The error is like this
File "/home/centos/byted/bytegnn/poc/sage.py", line 315, in main
train_nid = dgl.distributed.node_split(g.ndata["train_mask"], pb, force_even=True)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_graph.py", line 1059, in node_split
return _split_even(partition_book, rank, nodes)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_graph.py", line 983, in _split_even
eles = F.nonzero_1d(elements[0:len(elements)])
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/dist_tensor.py", line 167, in __getitem__
return self.kvstore.pull(name=self._name, id_tensor=idx)
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/kvstore.py", line 1192, in pull
return rpc.fast_pull(name, id_tensor, part_id, KVSTORE_PULL,
File "/home/centos/anaconda3/envs/py38/lib/python3.8/site-packages/dgl/distributed/rpc.py", line 975, in fast_pull
res_tensor = _CAPI_DGLRPCFastPull(name,
File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:43:06] /opt/dgl/src/rpc/rpc.cc:422: Check failed: l_id < local_data_shape[0] (18446744073709468015 vs. 83601) :
I am trying out current distributed setting in a non-typical configuration, with the goal of better understanding how each distributed module work with each other. This config goes like this: two graph servers (one partition each), and two pytorch trainer processes, in total 4 processes on a single local machine. That being said, the error occurs even before the training starts. My guess, it is caused by having two graph server/partitions on a single machine.
As mentioned above, if I by-pass fast-pull, always use slow-pull, then everything works. Given they are supposed to be equivalent, this means there is a bug in in fast_pull implementation?