Context error with EdgeDataLoader

sopkri · October 16, 2020, 8:45am

Hi there!

I am using the EdgeDataLoader and am training a RGCN on GPU. Initially, I moved the DGLHeteroGraph I created to GPU by specifying the device parameter:

het_graph = dgl.heterograph(data_dict=graph_dict, num_nodes_dict=num_nodes_dict, device=device)

I initialised the instance of the EdgeDataLoader (called train_loader) as follows:
(Background info:
training_graph is a DGLHeteroGraph that was moved to CUDA.

train_eid_dict = {
        canonical_etype: torch.arange(training_graph.num_edges(canonical_etype[1]), dtype=torch.int64).to(device)
        for canonical_etype in training_graph.canonical_etypes
    }

    sampler = dgl.dataloading.MultiLayerNeighborSampler([fanout] * n_layers)
    neg_sampler = dgl.dataloading.negative_sampler.Uniform(1)

    train_loader = dgl.dataloading.EdgeDataLoader(
        g=het_graph,
        eids=train_eid_dict, 
        block_sampler=sampler,
        batch_size=batch_size,
        g_sampling=training_graph, 
        negative_sampler=neg_sampler,
        shuffle=True,
    )

Yet, I receive an error that the so-called relation graphs don’t have the same context.

  File "/.../lib/python3.7/site-packages/dgl/heterograph_index.py", line 1054, in create_heterograph_from_relations
    metagraph, rel_graphs, num_nodes_per_type.todgltensor())
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [09:59:16] /opt/dgl/src/graph/heterograph.cc:129: Check failed: rg->Context() == ctx (cuda:0 vs. cpu:0) : Each relation graph must have the same context.

Do you have any idea why that could be?

mufeili · October 16, 2020, 5:13pm

This might be a bug.

Meanwhile, you don’t need to create the graph on GPU for sampling-based training. You can put that on CPU, use EdgeDataLoader to construct blocks and move blocks to GPU in each iteration.

sopkri · October 19, 2020, 8:58am

@mufeili Good to know, thanks for the reply. Do you have an estimate when this will be fixed?

mufeili · October 19, 2020, 2:49pm

For sampling based training, it is generally not recommended to put the full graph on GPU as computation is performed on blocks. You can move the blocks to GPU in each iteration.

sopkri · October 19, 2020, 2:50pm

@mufeili thank you for the recommendation, I put the individual blocks on GPU and it worked!

ayotme · March 22, 2022, 2:35am

Seems the bug has been not fixed. Sampling on CPU and send the blocks to GPU would be a good workaround, but it is somewhat time-consuming.

BarclayII · March 28, 2022, 6:14am

You will need to either put the graph onto GPU or set use_uva=True.