Problem to run DGL in GPU

In the last weeks, I was running my code on CPU in a toy graph, but after a refactor to run it in a larger graph and using GPU I’m getting the following error:

RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)

The error message is in the Relu function in the following class:

class PongConv(nn.Module):
    Definition of convolution in Pong model.
    def __init__(self, src_dim: int, dest_dim: int):

        self.linear_src = nn.Linear(in_features=src_dim, out_features=src_dim, bias=True)
        self.linear_dst = nn.Linear(in_features=dest_dim, out_features=src_dim, bias=True)

    def forward(self, graph: dgl.DGLGraph, node_features: Tuple[torch.FloatTensor, torch.FloatTensor]) -> torch.FloatTensor:
        with graph.local_scope():
            src_features, dst_features = node_features
            graph.srcdata['h'] = src_features
            graph.dstdata['h'] = dst_features

            # optimized implementation for a weighted average
            graph.update_all(fn.copy_e('weight', 'm'), fn.sum('m', 'sum_weight'))
            graph.apply_edges(fn.e_div_v('weight', 'sum_weight', 'normalized_weight'))

            # average neighbors embeddings
            graph.update_all(message_func=fn.u_mul_e('h', 'normalized_weight', 'h_ngh'),
                             reduce_func=fn.sum('h_ngh', 'neighbors_avg'))

            result = F.relu(
            return result

I think the issue is with the nn.Linear method, but no idea about how to solve it. Does anyone was some idea about why this is happening?

I’m using Pytorch and CUDA 11.0

Have you checked the device of graph, node_features, and the model?

Using the command:


I get the graph, and the features are in GPU, but not the model.

I think I manage to solve it by changing the linear definitions in my convolution to the following:

self.linear_src = nn.Linear(in_features=src_dim, out_features=src_dim, bias=True).to(torch.device('cuda:0'), non_blocking=True)
self.linear_dst = nn.Linear(in_features=dest_dim, out_features=src_dim, bias=True).to(torch.device('cuda:0'), non_blocking=True)

But I still with the original scaling problem that was my motivation to change from CPU to GPU. The training loop stuck in the backward step for several minutes, even using a Testa T4 GPU. My graph has the following characteristics:

Graph(num_nodes={'item': 4005, 'user': 16020},
      num_edges={('item', 'watched-by', 'user'): 256068, ('user', 'watched', 'item'): 256068},
      metagraph=[('item', 'user', 'watched-by'), ('user', 'item', 'watched')])

Note I am not using mini-batch at the moment. Is this needed even for run such a small graph?

I think you can directly perform model ='cuda:0'). Did you manage to pass the backward step? If so, since you did not encounter an OOM error, I guess you have enough GPU memory. Have you tried profiling the training with a tool like line_profiler?

I successfully put the model in the GPU. Thanks!

But I still with scaling problems. I’m wondering if it can be caused by PinSAGESampler, which I use every epoch. The PinSAGE method only works if I put the graph in the CPU; otherwise, it returns “Graph must be in CPU.”

        sampler = dgl.sampling.PinSAGESampler(

Have you tried tuning num_workers as here? Currently DGL does not have support for PinSAGESampler on GPU.