Backward timeout after 1800000ms

August · November 3, 2021, 5:52am

when i train with graph about ten billion edge, i get the error blow.
env:
centos7
python 3.6.8
torch 1.9.0
dgl 0.7.0

   loss.backward()
  File "/usr/local/lib64/python3.6/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib64/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: [/sources/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

VoVAllen · November 3, 2021, 6:21am

Hi,

Could you post the training script used? Also could you try our nightly release?

system · December 3, 2021, 6:22am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.