Error when train DistSAGE with OGB-Product

xcwanAndy · November 6, 2021, 11:13am

Hi, I got some bugs when training OGB-Product in an 8-node cluster.

If I set a small batch size, i.e., 1024. I found that the last worker would finish the first epoch with a correct output. However, all else workers will get the following error message:

 ** On entry to cusparseCreateCsr() dimension mismatch: nnz > rows * cols

Traceback (most recent call last):
  File "baseline/graphsage/train_dist_sage.py", line 392, in <module>
    main(args)
  File "baseline/graphsage/train_dist_sage.py", line 349, in main
    run(args, device, data)
  File "baseline/graphsage/train_dist_sage.py", line 250, in run
    batch_pred = model(blocks, batch_inputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "baseline/graphsage/train_dist_sage.py", line 88, in forward
    h = layer(block, h)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dgl/nn/pytorch/conv/sageconv.py", line 231, in forward
    graph.update_all(msg_fn, fn.mean('m', 'neigh'))
  File "/usr/local/lib/python3.6/dist-packages/dgl/heterograph.py", line 4686, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/usr/local/lib/python3.6/dist-packages/dgl/core.py", line 283, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/usr/local/lib/python3.6/dist-packages/dgl/core.py", line 258, in invoke_gspmm
    z = op(graph, x)
  File "/usr/local/lib/python3.6/dist-packages/dgl/ops/spmm.py", line 170, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/usr/local/lib/python3.6/dist-packages/dgl/ops/spmm.py", line 64, in gspmm
    lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/sparse.py", line 307, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/amp/autocast_mode.py", line 217, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/dgl/backend/pytorch/sparse.py", line 87, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/usr/local/lib/python3.6/dist-packages/dgl/sparse.py", line 162, in _gspmm
    arg_e_nd)
  File "/usr/local/lib/python3.6/dist-packages/dgl/_ffi/_ctypes/function.py", line 190, in __call__
    ctypes.byref(ret_val), ctypes.byref(ret_tcode)))
  File "/usr/local/lib/python3.6/dist-packages/dgl/_ffi/base.py", line 64, in check_call
    raise DGLError(py_str(_LIB.DGLGetLastError()))
dgl._ffi.base.DGLError: [11:01:50] /opt/dgl/src/array/cuda/spmm.cu:233: Check failed: e == CUSPARSE_STATUS_SUCCESS: CUSPARSE ERROR: 3
Stack trace:
  [bt] (0) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f4884d1f32f]
  [bt] (1) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(void dgl::aten::cusparse::CusparseCsrmm2<float, long>(DLContext const&, dgl::aten::CSRMatrix const&, float const*, float const*, float*, int)+0x160) [0x7f4885925300]
  [bt] (2) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(void dgl::aten::SpMMCsr<2, long, 32>(std::string const&, std::string const&, dgl::BcastOff const&, dgl::aten::CSRMatrix const&, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0xdc) [0x7f488596e95c]
  [bt] (3) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(dgl::aten::SpMM(std::string const&, std::string const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >)+0x2633) [0x7f4884e7e593]
  [bt] (4) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(+0x6b6838) [0x7f4884e8a838]
  [bt] (5) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(+0x6b6dd1) [0x7f4884e8add1]
  [bt] (6) /usr/local/lib/python3.6/dist-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7f4885423538]
  [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f498946ddae]
  [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f498946d71f]

But things work well with large batch size, i.e., 16384.

Would anyone please give me a suggestion?

VoVAllen · November 6, 2021, 6:11pm

Is this happening at training process or validation process?

xcwanAndy · November 7, 2021, 1:31am

It’s a bit weird as well. The last worker has finished the train loop and output epoch information.

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L243

    
      
              step_time.append(step_t)
              iter_tput.append(len(blocks[-1].dstdata[dgl.NID]) / step_t)
              if step % args.log_every == 0:
                  acc = compute_acc(batch_pred, batch_labels)
                  gpu_mem_alloc = th.cuda.max_memory_allocated() / 1000000 if th.cuda.is_available() else 0
                  print('Part {} | Epoch {:05d} | Step {:05d} | Loss {:.4f} | Train Acc {:.4f} | Speed (samples/sec) {:.4f} | GPU {:.1f} MB | time {:.3f} s'.format(
                      g.rank(), epoch, step, loss.item(), acc.item(), np.mean(iter_tput[3:]), gpu_mem_alloc, np.sum(step_time[-args.log_every:])))
              start = time.time()
          
          
toc = time.time()
          print('Part {}, Epoch Time(s): {:.4f}, sample+data_copy: {:.4f}, forward: {:.4f}, backward: {:.4f}, update: {:.4f}, #seeds: {}, #inputs: {}'.format(
              g.rank(), toc - tic, sample_time, forward_time, backward_time, update_time, num_seeds, num_inputs))
          epoch += 1
          
          

          
if epoch % args.eval_every == 0 and epoch != 0:
              start = time.time()
              val_acc, test_acc = evaluate(model.module, g, g.ndata['features'],
                                           g.ndata['labels'], val_nid, test_nid, args.batch_size_eval, device)
              print('Part {}, Val Acc {:.4f}, Test Acc {:.4f}, time: {:.4f}'.format(g.rank(), val_acc, test_acc,
                                                                                    time.time() - start))

However, all other workers are still in the loop and error in prediction:

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L220

    
      
          batch_inputs = blocks[0].srcdata['features']
          batch_labels = blocks[-1].dstdata['labels']
          batch_labels = batch_labels.long()
          
          
num_seeds += len(blocks[-1].dstdata[dgl.NID])
          num_inputs += len(blocks[0].srcdata[dgl.NID])
          blocks = [block.to(device) for block in blocks]
          batch_labels = batch_labels.to(device)
          # Compute loss and prediction
          start = time.time()
          batch_pred = model(blocks, batch_inputs)
          loss = loss_fcn(batch_pred, batch_labels)
          forward_end = time.time()
          optimizer.zero_grad()
          loss.backward()
          compute_end = time.time()
          forward_time += forward_end - start
          backward_time += compute_end - forward_end
          
          
optimizer.step()
          update_time += time.time() - compute_end

But I’ve set the train_nid to be splitted even:

github.com

dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L271

    
      
          
          
pb = g.get_partition_book()
          if 'trainer_id' in g.ndata:
              train_nid = dgl.distributed.node_split(g.ndata['train_mask'], pb, force_even=True,
                                                     node_trainer_ids=g.ndata['trainer_id'])
              val_nid = dgl.distributed.node_split(g.ndata['val_mask'], pb, force_even=True,
                                                   node_trainer_ids=g.ndata['trainer_id'])
              test_nid = dgl.distributed.node_split(g.ndata['test_mask'], pb, force_even=True,
                                                    node_trainer_ids=g.ndata['trainer_id'])
          else:
              train_nid = dgl.distributed.node_split(g.ndata['train_mask'], pb, force_even=True)
              val_nid = dgl.distributed.node_split(g.ndata['val_mask'], pb, force_even=True)
              test_nid = dgl.distributed.node_split(g.ndata['test_mask'], pb, force_even=True)
          local_nid = pb.partid2nids(pb.partid).detach().numpy()
          print('part {}, train: {} (local: {}), val: {} (local: {}), test: {} (local: {})'.format(
              g.rank(), len(train_nid), len(np.intersect1d(train_nid.numpy(), local_nid)),
              len(val_nid), len(np.intersect1d(val_nid.numpy(), local_nid)),
              len(test_nid), len(np.intersect1d(test_nid.numpy(), local_nid))))
          if args.num_gpus == -1:
              device = th.device('cpu')
          else:

VoVAllen · November 8, 2021, 7:10am

Hi,

Could you try to dump the error block using try except? Or save the edge lists for us to debug? Thanks

VoVAllen · November 8, 2021, 7:11am

Also what’s your cuda version?

xcwanAndy · November 9, 2021, 3:51am

@VoVAllen

Thanks a lot for your reply. I’ll upload the try except errors later.

For the version part, cuda version is 11.0, dgl version==0.6.1, python == 3.6.9.

VoVAllen · November 9, 2021, 8:59am

cuda11.0 is a buggy version. Could you try later cuda version such as 11.1, 11.2 or 11.3?

VoVAllen · November 9, 2021, 9:00am

Also this seems related to [Bugfix] Fix CUDA 11.1 crashing when number of edges is larger than number of node pairs by BarclayII · Pull Request #3265 · dmlc/dgl · GitHub

VoVAllen · November 9, 2021, 9:00am

Probably not really related to cuda version.

VoVAllen · November 9, 2021, 9:05am

Also is it possible for you to use newer dgl?

system · December 9, 2021, 9:05am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.