Dgl training with gloo timeout

File “/usr/local/lib64/python3.6/site-packages/torch/tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib64/python3.6/site-packages/torch/autograd/init.py”, line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

when I use large scale cluster

Did you run the code of a particular example?

Hi,

This is probably due to your worker exited unexpectedly

oh so sorry too see late , I found it’s cause by emb lookup cost too much time and so others nodes occur timeout during backward is some nodes (I use 100 machines node run a big graph)

so I think I need a strong ps servers and a asp sync-control method

if you are using embeddings, please try out DGL’s distributed embeddings.
Here is the API doc: dgl.distributed — DGL 0.6.0post1 documentation

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.