Dgl training with gloo timeout

lixusign · January 29, 2021, 4:10am

File “/usr/local/lib64/python3.6/site-packages/torch/tensor.py”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/usr/local/lib64/python3.6/site-packages/torch/autograd/init.py”, line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

when I use large scale cluster

mufeili · January 29, 2021, 4:21am

Did you run the code of a particular example?

VoVAllen · February 1, 2021, 7:19am

Hi,

This is probably due to your worker exited unexpectedly

lixusign · February 24, 2021, 10:39am

oh so sorry too see late ， I found it’s cause by emb lookup cost too much time and so others nodes occur timeout during backward is some nodes (I use 100 machines node run a big graph)

so I think I need a strong ps servers and a asp sync-control method

zhengda1936 · March 16, 2021, 2:26am

if you are using embeddings, please try out DGL’s distributed embeddings.
Here is the API doc: dgl.distributed — DGL 0.6.0post1 documentation

system · April 15, 2021, 2:27am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.