Error while trying to run GraphSAGE train_sampling_unsupervised.py example

charlieyun · August 6, 2020, 8:45pm

I was trying to run the example training model with only cpu but when it gets to the main method where it checks if n_gpus == 1 and executes run(0, n_gpus, args, devices, data), it gives an error of RuntimeError: Device index must not be negative. Could it be that instead of if n_gpus == 1 it should be elif n_gpus == 1? Since the first if clause checked for devices[0] == -1, meaning there is no gpu and only cpu should be used.

And also, should the line run(args, device, data) that comes after the following else clause be indented so it is inside the else clause? And the device variable in this line is not initialized anywhere, is it supposed to be devices instead?

Thank you so much!

mufeili · August 7, 2020, 9:12am

I think this should execute run(0, 0, args, ['cpu'], data) rather than run(0, n_gpus, args, devices, data).

charlieyun · August 7, 2020, 1:40pm

So the if else statements are correct? I just have to change the run condition in if n_gpus == 1, meaning that the model will be trained twice? And what about the last line in main method, run(args, device, data)?

mufeili · August 8, 2020, 9:12am

I see. That’s indeed a bug and thank you for the report. I’ve fixed it in PR 1973.

charlieyun · September 18, 2020, 6:30am

Hi @mufeili! I am trying to run this script now with cuda but it is giving me errors so I’ve been doing some modifications in the code.

As the --gpu argument only takes in a list of int, in run method, device = devices[proc_id] will result in device as a int. So I have manually changed device = 'cuda' (I only have 1 cuda device), so that all the .to(device) is actually moving the variables to cuda.

Then errors about graph on cpu instead of gpu appear, so I used graph.to('cuda') to move them to cuda. This works for most part except that I got stuck at this line in run method

   batch_pred = model(blocks, batch_inputs)

with this error

 The size of tensor a (239) must match the size of tensor b (238) at non-singleton dimension 0

and so I tried to look at the blocks and batch_inputs size

blocks:  
[Block(num_src_nodes=239, num_dst_nodes=239, num_edges=2209), Block(num_src_nodes=239, num_dst_nodes=238, num_edges=4877)] 
batch_inputs: 
tensor([[ 0.1176,  0.0267,  0.0682,  ..., -0.0811,  0.1327, -0.0654],
    [ 0.0055, -0.0862,  0.0992,  ...,  0.1190,  0.0212, -0.0026],
    [-0.1085,  0.1328,  0.0735,  ..., -0.1280,  0.0788, -0.0433],
    ...,
    [ 0.1032, -0.0583,  0.1064,  ...,  0.0943,  0.0396,  0.0045],
    [ 0.1174,  0.1042, -0.0406,  ...,  0.0431,  0.1444, -0.0271],
    [-0.1012, -0.0311, -0.0439,  ..., -0.1308,  0.1054, -0.0702]],
   device='cuda:0')
length of batch_inputs: 239

Is the second block causing the error? I imagine that it is because it has 239 src nodes but 238 dst nodes but I’m not sure what I did wrong because the dataloader was the one that made the blocks.
The dataloader is initialized with

dataloader = dgl.dataloading.EdgeDataLoader(
    g, train_seeds, sampler, 
    negative_sampler=NegativeSampler(g, args.num_negs),
    batch_size=args.batch_size,
    shuffle=True,
    drop_last=False,
    pin_memory=True,
    num_workers=args.num_workers)

Thank you!

BarclayII · September 21, 2020, 7:08am

num_src_nodes and num_dst_nodes represent the number of input nodes and output nodes necessary for a single GNN layer, so it should be fine if the two numbers don’t match.

What was the call stack of the message? Could you show the complete message? Also it will be helpful if you could tell us what changes you have made (other than the ones above). Thanks!

charlieyun · September 21, 2020, 5:01pm

Never mind, I didn’t realise SAGEConv was updated. I am defining a new layer based on SAGEConv by adding edge features into it, where at the end of the forward method, instead of

rst = self.fc_self(h_self) + self.fc_neigh(h_neigh)

it will have

rst = self.fc_self(h_self) + self.fc_neigh(h_neigh) + self.fc_edges(h_e)

and self.fc_edges is just a nn.Linear layer with input edge feature size to output feature size. For self._aggre_type == 'mean', h_e is aggregated with

if self._aggre_type == 'mean':
    print(feat_src.is_cuda)
    graph.srcdata['h'] = feat_src
    graph.update_all(fn.copy_e('h', 'm_e'), fn.mean('m_e', 'h_e'))
    graph.update_all(fn.copy_src('h', 'm_n'), fn.mean('m_n', 'neigh'))
    h_neigh = graph.dstdata['neigh']
    h_e = graph.dstdata['h_e']

However I have a new error now from the code above, saying that it has a

CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'dmlc::Error'
 what():  [16:48:58] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: an illegal memory access was encountered
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fa8d5c78bdf]
  [bt] (1) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0x15d) [0x7fa8d646ad8d]
  [bt] (2) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*)+0x1ad) [0x7fa8d632c7bd]
  [bt] (3) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::UnitGraph::COO::~COO()+0x127) [0x7fa8d6445ad7]
  [bt] (4) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::UnitGraph::~UnitGraph()+0x1ba) [0x7fa8d644586a]
  [bt] (5) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::HeteroGraph::~HeteroGraph()+0x119) [0x7fa8d6344559]
  [bt] (6) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(DGLObjectFree+0xb5) [0x7fa8d63037f5]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fa9304e5dae]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7fa9304e571f]

I tried sending the graph onto cuda then and this error occured.

Traceback (most recent call last):
  File "/content/drive/My Drive/Google Colab/nycTaxi/ESAGEConv.py", line 109, in forward
    graph.srcdata['h'] = feat_src
  File "/usr/local/lib/python3.6/dist-packages/dgl/view.py", line 81, in __setitem__
    self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
  File "/usr/local/lib/python3.6/dist-packages/dgl/heterograph.py", line 3811, in _set_n_repr
    ' same device.'.format(key, F.context(val), self.device))
dgl._ffi.base.DGLError: Cannot assign node feature "h" on device cpu to a graph on device cuda:0. Call DGLGraph.to() to copy the graph to the same device.

I’m not sure if this is what I should be doing.

BarclayII · September 22, 2020, 5:19am

What is your PyTorch version? DGL now requires PyTorch 1.5.0+.

charlieyun · September 22, 2020, 2:54pm

I just checked, the PyTorch version in my Google Colab is 1.6.0+cu101. Does it help to add that the error always occurs after step 0 of the last epoch? This is the output and it is followed by the Traceback error above.

[0]Epoch 00000 | Step 00000 | Loss 899041920.0000 | Speed (samples/sec) nan|nan | Load nan| train nan | GPU 1.0 MiB
[0]Epoch 00001 | Step 00000 | Loss 703352960.0000 | Speed (samples/sec) nan|nan | Load nan| train nan | GPU 1.0 MiB
[0]Epoch 00002 | Step 00000 | Loss 564441344.0000 | Speed (samples/sec) 307461.3790|307461.3790 | Load 0.0133| train 0.0175 | GPU 1.0 MiB
[0]Epoch 00003 | Step 00000 | Loss 613482176.0000 | Speed (samples/sec) 308538.3338|308538.3338 | Load 0.0132| train 0.0175 | GPU 1.0 MiB
[0]Epoch 00004 | Step 00000 | Loss 408088672.0000 | Speed (samples/sec) 301500.0920|301500.0920 | Load 0.0139| train 0.0176 | GPU 1.0 MiB
[0]Epoch 00005 | Step 00000 | Loss 389834688.0000 | Speed (samples/sec) 303131.9649|303131.9649 | Load 0.0137| train 0.0176 | GPU 1.0 MiB
[0]Epoch 00006 | Step 00000 | Loss 353877024.0000 | Speed (samples/sec) 303210.1286|303210.1286 | Load 0.0137| train 0.0176 | GPU 1.0 MiB
[0]Epoch 00007 | Step 00000 | Loss 291229696.0000 | Speed (samples/sec) 301813.4902|301813.4902 | Load 0.0136| train 0.0178 | GPU 1.0 MiB
[0]Epoch 00008 | Step 00000 | Loss 222750144.0000 | Speed (samples/sec) 303438.4107|303438.4107 | Load 0.0135| train 0.0177 | GPU 1.0 MiB
[0]Epoch 00009 | Step 00000 | Loss 188884032.0000 | Speed (samples/sec) 304053.4255|304053.4255 | Load 0.0135| train 0.0177 | GPU 1.0 MiB
  0% 0/1 [00:00<?, ?it/s]

BarclayII · September 28, 2020, 7:47am

Could you show the modified code of SAGEConv?

charlieyun · September 28, 2020, 9:48pm

You can just look at the ‘mean’ aggregator since I have have not moved on to testing the other ones. It is basically the same as the original SAGEConv except that it considers the edge features too.

class ESAGEConv(nn.Module):
    def __init__(self,
                 in_feats,
                 e_feats,       # edge feature size, same as in_feats
                 out_feats,
                 aggregator_type,
                 feat_drop=0.,
                 bias=True,
                 norm=None,
                 activation=None):
        super(ESAGEConv, self).__init__()

        # Return a pair of same element if the input is not a pair.
        self._in_src_feats, self._in_dst_feats = expand_as_pair(in_feats)
        self._in_e_feats = e_feats
        self._out_feats = out_feats
        self._aggre_type = aggregator_type
        self.norm = norm
        self.feat_drop = nn.Dropout(feat_drop)
        self.activation = activation
        # aggregator type: mean/pool/lstm/gcn
        if aggregator_type == 'pool':
            self.fc_pool = nn.Linear(self._in_src_feats, self._in_src_feats)
        if aggregator_type == 'lstm':
            self.lstm = nn.LSTM(self._in_src_feats, self._in_src_feats, batch_first=True)
        if aggregator_type != 'gcn':
            self.fc_self = nn.Linear(self._in_dst_feats, out_feats, bias=bias)
        self.fc_neigh = nn.Linear(self._in_src_feats, out_feats, bias=bias)
        #   added this
        self.fc_edges = nn.Linear(self._in_e_feats, out_feats, bias=bias)
        self.reset_parameters()

    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        gain = nn.init.calculate_gain('relu')
        if self._aggre_type == 'pool':
            nn.init.xavier_uniform_(self.fc_pool.weight, gain=gain)
        if self._aggre_type == 'lstm':
            self.lstm.reset_parameters()
        if self._aggre_type != 'gcn':
            nn.init.xavier_uniform_(self.fc_self.weight, gain=gain)
        nn.init.xavier_uniform_(self.fc_neigh.weight, gain=gain)

    def _lstm_reducer(self, nodes):
        """LSTM reducer
        NOTE(zihao): lstm reducer with default schedule (degree bucketing)
        is slow, we could accelerate this with degree padding in the future.
        """
        m = nodes.mailbox['m']  # (B, L, D)
        batch_size = m.shape[0]
        h = (m.new_zeros((1, batch_size, self._in_src_feats)),
             m.new_zeros((1, batch_size, self._in_src_feats)))
        _, (rst, _) = self.lstm(m, h)
        return {'neigh': rst.squeeze(0)}

    def forward(self, graph, feat): # , efeat):
        # added this so CUDA error mentioned earlier is handled?
        graph = graph.to('cuda')

        with graph.local_scope():
            if isinstance(feat, tuple):
                feat_src = self.feat_drop(feat[0])
                feat_dst = self.feat_drop(feat[1])
            else:
                feat_src = feat_dst = self.feat_drop(feat)
                if graph.is_block:
                    feat_dst = feat_src[:graph.number_of_dst_nodes()]

            h_self = feat_dst

            # Handle the case of graphs without edges
            if graph.number_of_edges() == 0:
                graph.dstdata['neigh'] = torch.zeros(
                  feat_dst.shape[0], self._in_src_feats).to(feat_dst)

            if self._aggre_type == 'mean':
                try:
                    graph.srcdata['h'] = feat_src
                    graph.update_all(fn.copy_e('h', 'm_e'), fn.mean('m_e', 'h_e'))
                    graph.update_all(fn.copy_src('h', 'm_n'), fn.mean('m_n', 'neigh'))

                    # does not work in a list
                    # graph.update_all([fn.copy_src('h', 'm_n'), fn.copy_edge('h', 'm_e')],
                                    #  [fn.mean('m_n', 'neigh'), fn.mean('m_e', 'h_e')])  # added edge feat
                except(BaseException, Exception)as e:
                    tb = traceback.format_exc()
                    print(tb)
                h_neigh = graph.dstdata['neigh']
                h_e = graph.dstdata['h_e']      # aggregated edge feat
            elif self._aggre_type == 'gcn':
                # If input is a pair of features, check if the feature shape of sourcenodes
                # is equal to the feature shape of destination nodes
                check_eq_shape(feat)
                graph.srcdata['h'] = feat_src
                graph.dstdata['h'] = feat_dst  # same as above if homogeneous
                graph.update_all([fn.copy_src('h', 'm_n'), fn.copy_e('h', 'm_e')],
                                [fn.sum('m_n', 'neigh'), fn.sum('m_e', 'h_e')])  # added edge feat
                # divide in_degrees 
                degs = graph.in_degrees().to(feat_dst)
                h_neigh = (graph.dstdata['neigh'] + graph.dstdata['h'] + graph.dstdata['h_e']) / (degs.unsqueeze(-1) + 1)
            elif self._aggre_type == 'pool':
                graph.srcdata['h'] = F.relu(self.fc_pool(feat_src))
                # graph.edata['h'] = F.relu(self.fc_pool(feat_e))
                graph.update_all([fn.copy_src('h', 'm_n'), fn.copy_e('h', 'm_e')],
                                [fn.max('m_n', 'neigh'), fn.max('m_e', 'h_e')])
                h_neigh = graph.dstdata['neigh']
                h_e = graph.dstdata['h_e']
            elif self._aggre_type == 'lstm':
                graph.srcdata['h'] = feat_src
                graph.update_all(fn.copy_src('h', 'm'), self._lstm_reducer)
                h_neigh = graph.dstdata['neigh']
            else:
                raise KeyError('Aggregator type {} not recognized.'.format(self._aggre_type))
            # GraphSAGE GCN does not require fc_self.
            if self._aggre_type == 'gcn':
                rst = self.fc_neigh(h_neigh)
            else:
                # changed this
                rst = self.fc_self(h_self) + self.fc_neigh(h_neigh) + self.fc_edges(h_e)
            # activation
            if self.activation is not None:
                rst = self.activation(rst)
            # normalization
            if self.norm is not None:
                rst = self.norm(rst)
            
            return rst

Thank you so much!!

charlieyun · September 30, 2020, 9:35pm

Okay so I added feat = feat.to('cuda') in the forward method along with the graph = graph.to('cuda') and it worked. Thank you for all your help!!