Backward timeout after 1800000ms

when i train with graph about ten billion edge, i get the error blow.
env:
centos7
python 3.6.8
torch 1.9.0
dgl 0.7.0

   loss.backward()
  File "/usr/local/lib64/python3.6/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [/sources/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

Hi,

Could you post the training script used? Also could you try our nightly release?

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.