DGLError: Caught DGLError in DataLoader worker process 0

ogggcar · November 23, 2021, 10:49am

When trying to use EdgeDataLoader with a subgraph of my original graph, I encounter this error, which doesnt appear with the original graph:

DGLError: Caught DGLError in DataLoader worker process 0.

I created the subgraph by doing this:

train_size = 0.8
test_size = 1-train_size

train_dict= {}

for etype in g.canonical_etypes:

    edge_ids = g.edges(form='eid', etype=etype)
    train_edges, test_edges = sklearn.model_selection.train_test_split(
        edge_ids,
        test_size=test_size,
        train_size=train_size,
        random_state=10,

    )
    train_dict[etype] = sorted(train_edges)
 
subgraph = dgl.edge_subgraph(graph=g, edges=train_dict, preserve_nodes=True)

How could I fix this? Thanks.

VoVAllen · November 24, 2021, 7:06am

Hi,

I’m not sure about the exact problem. Could you post more error messages?

ogggcar · November 24, 2021, 7:37am

Sure. This is the full snippet:

DGLError: Caught DGLError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.7/dist-packages/dgl/dataloading/pytorch/dataloader.py", line 294, in collate
    result = super().collate(items)
  File "/usr/local/lib/python3.7/dist-packages/dgl/dataloading/dataloader.py", line 888, in collate
    return self._collate_with_negative_sampling(items)
  File "/usr/local/lib/python3.7/dist-packages/dgl/dataloading/dataloader.py", line 812, in _collate_with_negative_sampling
    pair_graph = self.g.edge_subgraph(items, relabel_nodes=False)
  File "/usr/local/lib/python3.7/dist-packages/dgl/utils/internal.py", line 904, in _fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/dgl/subgraph.py", line 299, in edge_subgraph
    sgi = graph._graph.edge_subgraph(induced_edges, not relabel_nodes)
  File "/usr/local/lib/python3.7/dist-packages/dgl/heterograph_index.py", line 885, in edge_subgraph
    return _CAPI_DGLHeteroEdgeSubgraph(self, eids, preserve_nodes)
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [07:37:31] /opt/dgl/src/array/cpu/array_index_select.cc:25: Check failed: idx_data[i] < arr_len (23784 vs. 22684) : Index out of range.
Stack trace:
  [bt] (0) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f1260f24aef]
  [bt] (1) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dgl::runtime::NDArray dgl::aten::impl::IndexSelect<(DLDeviceType)1, long, long>(dgl::runtime::NDArray, dgl::runtime::NDArray)+0x2bb) [0x7f1260f3a59b]
  [bt] (2) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dgl::aten::IndexSelect(dgl::runtime::NDArray, dgl::runtime::NDArray)+0x863) [0x7f1260f11f83]
  [bt] (3) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dgl::UnitGraph::COO::EdgeSubgraph(std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> > const&, bool) const+0x51b) [0x7f126130c6cb]
  [bt] (4) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dgl::UnitGraph::EdgeSubgraph(std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> > const&, bool) const+0x69) [0x7f12613011a9]
  [bt] (5) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(dgl::HeteroGraph::EdgeSubgraph(std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> > const&, bool) const+0x449) [0x7f12611fb939]
  [bt] (6) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(+0x40148b) [0x7f126120b48b]
  [bt] (7) /usr/local/lib/python3.7/dist-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f126118f338]
  [bt] (8) /usr/local/lib/python3.7/dist-packages/dgl/_ffi/_cy3/core.cpython-37m-x86_64-linux-gnu.so(+0x16633) [0x7f12601fe633]

VoVAllen · November 24, 2021, 8:56am

Could you post more codes around the edge dataloader? Cannot tell the problem just from current messages

ogggcar · November 24, 2021, 9:20am

Sure, sorry and thank you.

Original graph:

Graph(num_nodes={'ent': 30047},
      num_edges={('ent', 'link1', 'ent'): 28356, ('ent', 'link2', 'ent'): 136976, ('ent', 'link3', 'ent'): 19494})

Train-test splits and edge IDs:

train_dict= {}
test_dict = {}
train_size = 0.8
test_size = 1-train_size

for etype in g.canonical_etypes:
    edge_ids = g.edges(form='eid', etype=etype)
    train_edges, test_edges = sklearn.model_selection.train_test_split(
        edge_ids,
        test_size=test_size,
        train_size=train_size,
        random_state=10,

    )
    train_dict[etype] = sorted(train_edges)
    test_dict [etype] = sorted(test_edges)

train_subgraph = dgl.edge_subgraph(graph=g, edges=train_dict, preserve_nodes=True)
test_subgraph = dgl.edge_subgraph(graph=g, edges=test_dict ,preserve_nodes=True)

edge_id_train = {}
for etype in train_dict:
  edge_id_train[etype] = torch.stack(train_dict[etype])

edge_id_test = {}
for etype in test_dict:
  edge_id_test[etype] = torch.stack(test_dict[etype])

Dataloader:

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
train_dataloader = dgl.dataloading.EdgeDataLoader(
    train_subgraph,edge_id_train, sampler,
    negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
    batch_size=1024,
    shuffle=True,
    drop_last=False,
    num_workers=2)

Error appears, for example, at:

for input_nodes, positive_graph, negative_graph, blocks in train_dataloader:
  print(input_nodes)

VoVAllen · November 25, 2021, 9:03am

edge id is relabeld for train_subgraph. You need to use the edge id in the subgraph but not the original graph

ogggcar · November 25, 2021, 9:24am

Thank you!!

Sorry, but I dont fully understand. Then, what part of my code should I change?

VoVAllen · November 26, 2021, 5:54am

If you want to run on all the edges in subgraph. Your edge_id_train should be like torch.arange(g.num_edges()) something like that.
The edge_id_train in your code is currently the edge id on the parent graph, but not train_subgraph. They are diffirent

ogggcar · November 26, 2021, 6:26am

Mmm dont know if I am getting this right, but I want to split edges from the original graph in 90% for training and 10% for testing, so I donde run all the edges in subgraph, right?

But it’s just 90% of those ids, right?

Sorry I still dont see this.

Since this is hetero, I’ve tried:

train_eid_dict = {canonical_etype: torch.arange(g.num_edges(canonical_etype[1]), dtype=torch.int64) for canonical_etype in g.canonical_etypes}

but still same Error.

VoVAllen · November 26, 2021, 6:29am

My bad, it should be torch.arange(train_subgraph.num_edges())

VoVAllen · November 26, 2021, 6:30am

Or you can change train_subgraph to g in

train_dataloader = dgl.dataloading.EdgeDataLoader(
    train_subgraph,edge_id_train, sampler,
    negative_sampler=dgl.dataloading.negative_sampler.Uniform(5),
    batch_size=1024,
    shuffle=True,
    drop_last=False,
    num_workers=2)

ogggcar · November 26, 2021, 6:38am

Ohh ok, now I see. Thank you much. That was it. As said before, since it’s hetero, this:

should be replaced by:

train_eid_dict = {canonical_etype: torch.arange(train_subgraph.num_edges(canonical_etype[1]), dtype=torch.int64) for canonical_etype in train_subgraph.canonical_etypes}

Thank you for your continued help and have a nice day, @VoVAllen

system · December 26, 2021, 6:39am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.