Hi! I have launched Node Classification baseline on a graph with ~8M nodes and ~2.5B edges. The training is performed on a remote cluster in online manner (graph is assembled through a series of SQL-like queries), thus, I am unable to monitor the graph structure. The graph is built from the edge index table, which is read from the database. I have the code which filters edges with some of the ends not presented in the nodes set, which is defined in another table, thus every node in the edge index is presented in a graph (this was just a precaution measure).
Back to the problem - I build the graph using dgl.graph
function using the following code:
def _construct_dgl_graph(
adjacency_matrix_rows_cols, # dict containing "row_coords" key - node indices of all source nodes and "col_coords" - node of all target nodes respectively
features: np.ndarray, # [N, d] feature array, where N==number of nodes, d is the initial feature dimension
targets: np.ndarray, # [1, ] label column
node_ids: np.ndarray, # unique node ids
# masks for proper nodes separation on train/val/test subsets during training:
train_mask: np.ndarray,
val_mask: np.ndarray,
test_mask: np.ndarray,
):
row_coordinates, col_coordinates = (
adjacency_matrix_rows_cols["row_coords"],
adjacency_matrix_rows_cols["col_coords"],
)
row_coordinates = torch.tensor(row_coordinates).long()
col_coordinates = torch.tensor(col_coordinates).long()
graph = dgl.graph(data=(row_coordinates, col_coordinates), idtype=torch.int32, num_nodes=len(node_ids))
graph.ndata[FEATURES_DATA_NAME] = torch.tensor(features, dtype=torch.float32)
graph.ndata[LABELS_DATA_NAME] = torch.tensor(targets, dtype=torch.float32).reshape(-1, 1)
graph.ndata[TRAIN_MASK_DATA_NAME] = torch.tensor(train_mask, dtype=torch.bool).reshape(-1, 1)
graph.ndata[VAL_MASK_DATA_NAME] = torch.tensor(val_mask, dtype=torch.bool).reshape(-1, 1)
graph.ndata[TEST_MASK_DATA_NAME] = torch.tensor(test_mask, dtype=torch.bool).reshape(-1, 1)
graph.ndata[NODE_ID_DATA_NAME] = torch.tensor(node_ids, dtype=torch.long).reshape(-1, 1)
return graph
After that, I remove self loops from the graph and get the error:
if config.remove_self_loops: # this option is currently True
graph = remove_self_loop(graph)
The error trace is the following:
Traceback (most recent call last):
File "/slot/sandbox/d/in/script/0_script_unpacked/code/main.py", line 397, in <module>
main()
File "/slot/sandbox/d/in/script/0_script_unpacked/code/main.py", line 330, in main
graph = remove_self_loop(graph)
^^^^^^^^^^^^^^^^^^^^^^^
File "/conda/envs/main/lib/python3.11/site-packages/dgl/transforms/functional.py", line 2117, in remove_self_loop
u, v = g.edges(form="uv", order="eid", etype=etype)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/conda/envs/main/lib/python3.11/site-packages/dgl/view.py", line 179, in __call__
return self._graph.all_edges(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/conda/envs/main/lib/python3.11/site-packages/dgl/heterograph.py", line 3591, in all_edges
src, dst, eid = self._graph.edges(self.get_etype_id(etype), order)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/conda/envs/main/lib/python3.11/site-packages/dgl/heterograph_index.py", line 696, in edges
edge_array = _CAPI_DGLHeteroEdges(self, int(etype), order)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__
File "dgl/_ffi/_cython/./function.pxi", line 227, in dgl._ffi._cy3.core.FuncCall
File "dgl/_ffi/_cython/./function.pxi", line 217, in dgl._ffi._cy3.core.FuncCall3
dgl._ffi.base.DGLError: [21:55:09] /opt/dgl/src/array/cpu/array_op_impl.cc:268: Check failed: high >= low: high must be bigger than low
Stack trace:
[bt] (0) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x6c) [0x7f8d4140e9dc]
[bt] (1) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::Range<(DGLDeviceType)1, int>(int, int, DGLContext)+0x82) [0x7f8d41412ef2]
[bt] (2) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::Range(long, long, unsigned char, DGLContext)+0x1fb) [0x7f8d413b635b]
[bt] (3) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x8e) [0x7f8d41a5f14e]
[bt] (4) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::UnitGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x9f) [0x7f8d41a4e4ff]
[bt] (5) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(dgl::HeteroGraph::Edges(unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x2c) [0x7f8d41952b5c]
[bt] (6) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(+0x95f772) [0x7f8d4195f772]
[bt] (7) /conda/envs/main/lib/python3.11/site-packages/dgl/libdgl.so(DGLFuncCall+0x4f) [0x7f8d418e839f]
[bt] (8) /conda/envs/main/lib/python3.11/site-packages/dgl/_ffi/_cy3/core.cpython-311-x86_64-linux-gnu.so(+0x1e63d) [0x7f8d981c063d]
i was able to track down the source code location which raises the exception, but I missed the intermediate transitions and I struggle to understand the meaning of the error message and the context of the error itself (). I will be grateful if someone is able to shed a light what could cause the problem here. I’m sorry for such shady context, I’m afraid I know nothing about the graph structure, and thus there can be anything there. I haven’t found any substantial information about this error on the internet yet.
TLDR:
- What can cause this error to occur?
- Can it be caused by multiple edges in the edge index?
I am using torch==2.3.0, dgl==2.3, cuda==12.1