Pending in "loss.backward" when running Distributed training example

Problem

I’m trying to run the example of distributed training from: https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/experimental

I set the cluster information properly but find the program pendng.

As I dig into the code, I found it pending at loss.backward: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L226

which indicates that the program pends when implementing backward propagation. See logs blow:

Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Namespace(batch_size=1000, batch_size_eval=10000, close_profiler=False, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='reddit', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=1, num_hidden=16, num_layers=2, num_servers=2, num_workers=4, part_config=None, standalone=False)
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (0) connect to server successfuly!
Machine (0) client (5) connect to server successfuly!
rank: 1
rank: 0
part 1, train: 76715 (local: 76715), val: 11915 (local: 11915), test: 27851 (local: 27851)
part 0, train: 76716 (local: 74411), val: 11916 (local: 11596), test: 27852 (local: 27015)
#labels: 41
#labels: 41

Note that the pending isn’t due to the large dataset, as it’s been a whole day for me to check if there is any progress of the program, but it ends up with nothing.

My cluster information is listed below:

  • Two machines, each equipped with a V100 GPU.
  • The content of “ip_config.txt” is:
192.168.1.215
192.168.1.211

I’m in urgent need to run this example successfully, so please help me with this issue please. Thanks.

Hi,

I believe the block is at pytorch side. loss.backward() will invoke allreduce operation across machines. I’m not sure what’s the root cause. Maybe you could try a simplest distributed model with pytorch to see whether it’s blocked

@VoVAllen Thanks for you suggestion! I ran a distributed training with pytorch, and just find out there may be something wrong with NCCL and RDMA. So I use “NCCL_IB_DISABLE=1” to temporarily use TCP for communication. And the pytorch code runs and finishes successfully.

Now the thing is that: DGL training processes several seconds and then pends, and the log is:

Part 0 | Epoch 00011 | Step 00000 | Loss 1.8278 | Train Acc 0.4000 | Speed (samples/sec) 5300.1638 | GPU 25.7 MiB | time 0.013 s
Part 0, Epoch Time(s): 0.0690, sample+data_copy: 0.0467, forward: 0.0071, backward: 0.0037, update: 0.0010, #seeds: 70, #inputs: 805
Part 1 | Epoch 00011 | Step 00000 | Loss 1.8344 | Train Acc 0.3714 | Speed (samples/sec) 5122.9244 | GPU 25.9 MiB | time 0.016 s
Part 1, Epoch Time(s): 0.0690, sample+data_copy: 0.0527, forward: 0.0094, backward: 0.0037, update: 0.0009, #seeds: 70, #inputs: 796
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (0) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (4) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (3) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (2) connect to server successfuly!

I compare this log with the previous one:

Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (1) connect to server successfuly!
Machine (0) client (5) connect to server successfuly!
rank: 1
rank: 0

I think it is because one of the machine ( Machine (0)) cannot successfully connect to other server. And I backtrack the dgl code warning is in rpc section. Could you please give me some suggestions?

Sorry for the late reply, I’m not sure what’s the problem is. Seems one machine is not properly connected. Could you check the pytorch/dgl/python versions?

Yes, the versions are listed below:

  • Python 3.7.0
  • PyTorch 1.6.0
  • dgl-cuda10.2

Could you please tell me how to check the connected machine when running dgl, so that I can debug quickly? Because the two machines run distributed PyTorch successfully.

Besides, I use cuda runtime 10.2, cuda driver 11.1.