Error in `remove_self_loops`, "Check failed: high >= low: high must be bigger than low"

Hi! I have launched Node Classification baseline on a graph with ~8M nodes and ~2.5B edges. The training is performed on a remote cluster in online manner (graph is assembled through a series of SQL-like queries), thus, I am unable to monitor the graph structure. The graph is built from the edge index table, which is read from the database. I have the code which filters edges with some of the ends not presented in the nodes set, which is defined in another table, thus every node in the edge index is presented in a graph (this was just a precaution measure).

Back to the problem - I build the graph using dgl.graph function using the following code:

def _construct_dgl_graph(
    adjacency_matrix_rows_cols, # dict containing "row_coords" key - node indices of all source nodes and "col_coords" - node of all target nodes respectively
    features: np.ndarray, # [N, d] feature array, where N==number of nodes, d is the initial feature dimension
    targets: np.ndarray, # [1, ] label column
    node_ids: np.ndarray, # unique node ids
    # masks for proper nodes separation on train/val/test subsets during training:
    train_mask: np.ndarray,
    val_mask: np.ndarray,
    test_mask: np.ndarray,
):
    row_coordinates, col_coordinates = (
        adjacency_matrix_rows_cols["row_coords"],
        adjacency_matrix_rows_cols["col_coords"],
    )

    row_coordinates = torch.tensor(row_coordinates).long()
    col_coordinates = torch.tensor(col_coordinates).long()
    graph = dgl.graph(data=(row_coordinates, col_coordinates), idtype=torch.int32, num_nodes=len(node_ids))
    graph.ndata[FEATURES_DATA_NAME] = torch.tensor(features, dtype=torch.float32)
    graph.ndata[LABELS_DATA_NAME] = torch.tensor(targets, dtype=torch.float32).reshape(-1, 1)

    graph.ndata[TRAIN_MASK_DATA_NAME] = torch.tensor(train_mask, dtype=torch.bool).reshape(-1, 1)
    graph.ndata[VAL_MASK_DATA_NAME] = torch.tensor(val_mask, dtype=torch.bool).reshape(-1, 1)
    graph.ndata[TEST_MASK_DATA_NAME] = torch.tensor(test_mask, dtype=torch.bool).reshape(-1, 1)

    graph.ndata[NODE_ID_DATA_NAME] = torch.tensor(node_ids, dtype=torch.long).reshape(-1, 1)

    return graph

After that, I remove self loops from the graph and get the error:

if config.remove_self_loops: # this option is currently True
        graph = remove_self_loop(graph)

The error trace is the following:

Traceback (most recent call last):
  File "/slot/sandbox/d/in/script/0_script_unpacked/code/main.py", line 397, in <module>
    main()
  File "/slot/sandbox/d/in/script/0_script_unpacked/code/main.py", line 330, in main
    graph = remove_self_loop(graph)
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/main/lib/python3.11/site-packages/dgl/transforms/functional.py", line 2117, in remove_self_loop
    u, v = g.edges(form="uv", order="eid", etype=etype)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/main/lib/python3.11/site-packages/dgl/view.py", line 179, in __call__
    return self._graph.all_edges(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/main/lib/python3.11/site-packages/dgl/heterograph.py", line 3591, in all_edges
    src, dst, eid = self._graph.edges(self.get_etype_id(etype), order)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/conda/envs/main/lib/python3.11/site-packages/dgl/heterograph_index.py", line 696, in edges
    edge_array = _CAPI_DGLHeteroEdges(self, int(etype), order)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [21:55:09] /opt/dgl/src/array/cpu/array_op_impl.cc:268: Check failed: high >= low: high must be bigger than low
Stack trace:
  [bt] (0) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7f8d4140e9dc]
  [bt] (1) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::Range<(DGLDeviceType)1, int>(int, int, DGLContext)+0x82) [0x7f8d41412ef2]
  [bt] (2) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::Range(long, long, unsigned char, DGLContext)+0x1fb) [0x7f8d413b635b]
  [bt] (3) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x8e) [0x7f8d41a5f14e]
  [bt] (4) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x9f) [0x7f8d41a4e4ff]
  [bt] (5) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::HeteroGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x2c) [0x7f8d41952b5c]
  [bt] (6) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(+0x95f772) [0x7f8d4195f772]
  [bt] (7) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(DGLFuncCall+0x4f) [0x7f8d418e839f]
  [bt] (8) /conda/envs/main/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so(+0x1e63d) [0x7f8d981c063d]

i was able to track down the source code location which raises the exception, but I missed the intermediate transitions and I struggle to understand the meaning of the error message and the context of the error itself (). I will be grateful if someone is able to shed a light what could cause the problem here. I’m sorry for such shady context, I’m afraid I know nothing about the graph structure, and thus there can be anything there. I haven’t found any substantial information about this error on the internet yet.

TLDR:

  • What can cause this error to occur?
  • Can it be caused by multiple edges in the edge index?

I am using torch==2.3.0, dgl==2.3, cuda==12.1

Can you try changing the idtype to int64?

Actually, I think that indeed might cause the problem - I stored node indices in int32 in edge index, which could lead to the overflow. Changed it to int64, currently monitoring the process, thank you! I’ll be right back with the updates.

Actually, for now I wasn’t able to reproduce this issue as graphs I was working with became significantly smaller. It seems that there was indeed int32 overflow.