Context error with EdgeDataLoader

Hi there!

I am using the EdgeDataLoader and am training a RGCN on GPU. Initially, I moved the DGLHeteroGraph I created to GPU by specifying the device parameter:

het_graph = dgl.heterograph(data_dict=graph_dict, num_nodes_dict=num_nodes_dict, device=device)

I initialised the instance of the EdgeDataLoader (called train_loader) as follows:
(Background info:
training_graph is a DGLHeteroGraph that was moved to CUDA.

train_eid_dict = {
        canonical_etype: torch.arange(training_graph.num_edges(canonical_etype[1]), dtype=torch.int64).to(device)
        for canonical_etype in training_graph.canonical_etypes
    }

    sampler = dgl.dataloading.MultiLayerNeighborSampler([fanout] * n_layers)
    neg_sampler = dgl.dataloading.negative_sampler.Uniform(1)

    train_loader = dgl.dataloading.EdgeDataLoader(
        g=het_graph,
        eids=train_eid_dict, 
        block_sampler=sampler,
        batch_size=batch_size,
        g_sampling=training_graph, 
        negative_sampler=neg_sampler,
        shuffle=True,
    )

Yet, I receive an error that the so-called relation graphs don’t have the same context.

  File "/.../lib/python3.7/site-packages/dgl/heterograph_index.py", line 1054, in create_heterograph_from_relations
    metagraph, rel_graphs, num_nodes_per_type.todgltensor())
  File "dgl/_ffi/_cython/./function.pxi", line 287, in dgl._ffi._cy3.core.FunctionBase.__call__
  File "dgl/_ffi/_cython/./function.pxi", line 222, in dgl._ffi._cy3.core.FuncCall
  File "dgl/_ffi/_cython/./function.pxi", line 211, in dgl._ffi._cy3.core.FuncCall3
  File "dgl/_ffi/_cython/./base.pxi", line 155, in dgl._ffi._cy3.core.CALL
dgl._ffi.base.DGLError: [09:59:16] /opt/dgl/src/graph/heterograph.cc:129: Check failed: rg->Context() == ctx (cuda:0 vs. cpu:0) : Each relation graph must have the same context.

Do you have any idea why that could be?

This might be a bug.

Meanwhile, you don’t need to create the graph on GPU for sampling-based training. You can put that on CPU, use EdgeDataLoader to construct blocks and move blocks to GPU in each iteration.

@mufeili Good to know, thanks for the reply. Do you have an estimate when this will be fixed?

For sampling based training, it is generally not recommended to put the full graph on GPU as computation is performed on blocks. You can move the blocks to GPU in each iteration.

1 Like

@mufeili thank you for the recommendation, I put the individual blocks on GPU and it worked!

Seems the bug has been not fixed. Sampling on CPU and send the blocks to GPU would be a good workaround, but it is somewhat time-consuming.

You will need to either put the graph onto GPU or set use_uva=True.