Check failed: rid < mat.num_rows in EdgeDataLoader

sopkri · October 19, 2020, 2:48pm

Hi there!

I am using the EdgeDataLoader for training an RGCN. When I get to the validation of the model, I also use the EdgeDataLoader to provide me with

input_nodes, positive_graph, negative_graph, blocks = dataloader

But here I get an error saying that the check failed for rid < mat.num_rows. Here is the full traceback:

  File "/.../LinkPredictHetero.py", line 344, in inference
    input_nodes, positive_graph, negative_graph, blocks = dataloader
  File "/.../lib/python3.7/site-packages/dgl/dataloading/pytorch/__init__.py", line 161, in __next__
    input_nodes, pair_graph, neg_pair_graph, blocks = next(self.iter_)
  File ".../lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/.../lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/.../lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File ".../lib/python3.7/site-packages/dgl/dataloading/pytorch/__init__.py", line 133, in collate
    input_nodes, pair_graph, neg_pair_graph, blocks = super().collate(items)
  File "/.../lib/python3.7/site-packages/dgl/dataloading/dataloader.py", line 678, in collate
    return self._collate_with_negative_sampling(items)
  File "/.../lib/python3.7/site-packages/dgl/dataloading/dataloader.py", line 634, in _collate_with_negative_sampling
    self.g_sampling, seed_nodes, exclude_eids=exclude_eids)
  File "/.../lib/python3.7/site-packages/dgl/dataloading/dataloader.py", line 216, in sample_blocks
    frontier = self.sample_frontier(block_id, g, seed_nodes)
  File ".../lib/python3.7/site-packages/dgl/dataloading/neighbor.py", line 73, in sample_frontier
    frontier = sampling.sample_neighbors(g, seed_nodes, fanout, replace=self.replace)
  File "/.../lib/python3.7/site-packages/dgl/sampling/neighbor.py", line 154, in sample_neighbors
    edge_dir, prob_arrays, replace)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 232, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [16:42:34] /tmp/dgl_src/src/array/cpu/./rowwise_pick.h:89: Check failed: rid < mat.num_rows (36 vs. 29) : 
Stack trace:
  [bt] (0) 1   libdgl.dylib                        0x00000001204dc12f dmlc::LogMessageFatal::~LogMessageFatal() + 111
  [bt] (1) 2   libdgl.dylib                        0x000000012051c028 dgl::aten::COOMatrix dgl::aten::impl::CSRRowWisePick<long long>(dgl::aten::CSRMatrix, dgl::runtime::NDArray, long long, bool, std::__1::function<void (long long, long long, long long, long long const*, long long const*, long long*)>) + 1032
  [bt] (2) 3   libdgl.dylib                        0x000000012051db01 dgl::aten::COOMatrix dgl::aten::impl::CSRRowWiseSamplingUniform<(DLDeviceType)1, long long>(dgl::aten::CSRMatrix, dgl::runtime::NDArray, long long, bool) + 241
  [bt] (3) 4   libdgl.dylib                        0x00000001204c4eea dgl::aten::CSRRowWiseSampling(dgl::aten::CSRMatrix, dgl::runtime::NDArray, long long, dgl::runtime::NDArray, bool) + 1978
  [bt] (4) 5   libdgl.dylib                        0x0000000120e4b3b4 dgl::sampling::SampleNeighbors(std::__1::shared_ptr<dgl::BaseHeteroGraph>, std::__1::vector<dgl::runtime::NDArray, std::__1::allocator<dgl::runtime::NDArray> > const&, std::__1::vector<long long, std::__1::allocator<long long> > const&, dgl::EdgeDir, std::__1::vector<dgl::runtime::NDArray, std::__1::allocator<dgl::runtime::NDArray> > const&, bool) + 2212
  [bt] (5) 6   libdgl.dylib                        0x0000000120e503b1 std::__1::__function::__func<dgl::sampling::$_0, std::__1::allocator<dgl::sampling::$_0>, void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)>::operator()(dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) + 1025
  [bt] (6) 7   libdgl.dylib                        0x0000000120d784d8 DGLFuncCall + 72
  [bt] (7) 8   core.cpython-37m-darwin.so          0x00000001215381a5 __pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) + 965
  [bt] (8) 9   core.cpython-37m-darwin.so          0x000000012153c3f4 __pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) + 52

Could you tell me what this means? And do you have an idea how to solve it?

BarclayII · October 20, 2020, 3:53am

Does your entries in seed_nodes have values exceed or equal to the number of nodes in your graph?

sopkri · October 20, 2020, 7:53am

@BarclayII Since I am using the EdgeDataLoader, I do not have a parameter seed_nodes. Do you mean the eids ( ( Tensor or dict [ etype , Tensor ] ) – The edge set in graph g to compute outputs) instead?

This might actually be the issue though. I think I have an error in how I constructed this value.
What I did was:

eids = {
        canonical_etype: torch.arange(training_graph.num_edges(canonical_etype[1]), dtype=torch.int64).to(device)
        for canonical_etype in training_graph.canonical_etypes
    } 

fanout = 4
n_layers = 3

sampler = dgl.dataloading.MultiLayerNeighborSampler([fanout] * n_layers)

neg_sampler = dgl.dataloading.negative_sampler.Uniform(1)

train_loader = dgl.dataloading.EdgeDataLoader(
        g=g,
        eids=eids, 
        block_sampler=sampler,
        batch_size=batch_size,
        g_sampling=training_graph,
        negative_sampler=neg_sampler,
        shuffle=True,
    )

Meaning I have a tensor edge IDs for each edge type of range number of edges in the training graph.
The numbers of edges for each edge type are

>>> [training_graph.number_of_edges(canonical_etype[1]) for canonical_etype in training_graph.canonical_etypes]

[22, 42, 22, 92, 92, 5561, 5561, 656, 656, 2263, 2263, 42, 4028, 4028]

and the values for the number of nodes in the training graph and entire graph is:

>>> training_graph.num_nodes()
406
>>> g.num_nodes()
446

meaning that there are higher edge IDs than numbers of nodes.

Could you tell me what the value of eids should be and if this is wrong?

One crucial information might also be that I have inverse edges in my graph. So they are listed under the canonical_etypes, but not handled differently. In the examples from the EdgeDataLoader, I saw that there is this line:

reverse_eids = torch.cat([torch.arange(E, 2 * E), torch.arange(0, E)])

giving an example for how to create the eids for the reverse edges in case there is only one edge type with only one source and destination node type. How can I create this for multiple edge types if some of m edge types have the same source and destination node type?

BarclayII · October 20, 2020, 8:58am

No, having higher EIDs than the number of nodes is fine, because the number of edges can as well be larger than the number of nodes.

That only creates a mapping from the edge ID to its reverse edge ID. Shouldn’t have any problem.

sopkri:

train_loader = dgl.dataloading.EdgeDataLoader(
        g=g,
        eids=eids, 
        block_sampler=sampler,
        batch_size=batch_size,
        g_sampling=training_graph,
        negative_sampler=neg_sampler,
        shuffle=True,
    )

It seems that your training graph is different from g. It’s OK to have them different, but does training_graph have the same number of nodes for each type as g? There will be problems if not because the seed edges are picked from g but the neighbors are picked from training_graph, so the seed nodes picked from g may not exist in training_graph.

sopkri · October 20, 2020, 9:27am

@BarclayII

This is exactly the case. The training_graph has less number of nodes for each node type than the entire graph g. In this case, should I set the parameter g = training_graph and just not use the g_sampling?

BarclayII · October 20, 2020, 10:03am

Yes.

(Filler for 20-character requirement)

sopkri · October 20, 2020, 10:30am

It worked! Thanks for the help @BarclayII !