Pending in "loss.backward" when running Distributed training example

xcwanAndy · October 26, 2020, 11:47am

Problem

I’m trying to run the example of distributed training from: https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/experimental

I set the cluster information properly but find the program pendng.

As I dig into the code, I found it pending at loss.backward: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L226

which indicates that the program pends when implementing backward propagation. See logs blow:

Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Namespace(batch_size=1000, batch_size_eval=10000, close_profiler=False, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='reddit', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=1, num_hidden=16, num_layers=2, num_servers=2, num_workers=4, part_config=None, standalone=False)
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (0) connect to server successfuly!
Machine (0) client (5) connect to server successfuly!
rank: 1
rank: 0
part 1, train: 76715 (local: 76715), val: 11915 (local: 11915), test: 27851 (local: 27851)
part 0, train: 76716 (local: 74411), val: 11916 (local: 11596), test: 27852 (local: 27015)
#labels: 41
#labels: 41

Note that the pending isn’t due to the large dataset, as it’s been a whole day for me to check if there is any progress of the program, but it ends up with nothing.

My cluster information is listed below:

Two machines, each equipped with a V100 GPU.
The content of “ip_config.txt” is:

192.168.1.215
192.168.1.211

I’m in urgent need to run this example successfully, so please help me with this issue please. Thanks.

VoVAllen · October 27, 2020, 4:41am

Hi,

I believe the block is at pytorch side. loss.backward() will invoke allreduce operation across machines. I’m not sure what’s the root cause. Maybe you could try a simplest distributed model with pytorch to see whether it’s blocked

xcwanAndy · October 28, 2020, 6:10am

@VoVAllen Thanks for you suggestion! I ran a distributed training with pytorch, and just find out there may be something wrong with NCCL and RDMA. So I use “NCCL_IB_DISABLE=1” to temporarily use TCP for communication. And the pytorch code runs and finishes successfully.

Now the thing is that: DGL training processes several seconds and then pends, and the log is:

Part 0 | Epoch 00011 | Step 00000 | Loss 1.8278 | Train Acc 0.4000 | Speed (samples/sec) 5300.1638 | GPU 25.7 MiB | time 0.013 s
Part 0, Epoch Time(s): 0.0690, sample+data_copy: 0.0467, forward: 0.0071, backward: 0.0037, update: 0.0010, #seeds: 70, #inputs: 805
Part 1 | Epoch 00011 | Step 00000 | Loss 1.8344 | Train Acc 0.3714 | Speed (samples/sec) 5122.9244 | GPU 25.9 MiB | time 0.016 s
Part 1, Epoch Time(s): 0.0690, sample+data_copy: 0.0527, forward: 0.0094, backward: 0.0037, update: 0.0009, #seeds: 70, #inputs: 796
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (0) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (4) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (3) connect to server successfuly!
Using backend: pytorch
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (2) connect to server successfuly!

I compare this log with the previous one:

Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (1) connect to server successfuly!
Machine (0) client (5) connect to server successfuly!
rank: 1
rank: 0

I think it is because one of the machine ( Machine (0)) cannot successfully connect to other server. And I backtrack the dgl code warning is in rpc section. Could you please give me some suggestions?

VoVAllen · October 29, 2020, 5:07pm

Sorry for the late reply, I’m not sure what’s the problem is. Seems one machine is not properly connected. Could you check the pytorch/dgl/python versions?

xcwanAndy · October 30, 2020, 2:22pm

Yes, the versions are listed below:

Python 3.7.0
PyTorch 1.6.0
dgl-cuda10.2

Could you please tell me how to check the connected machine when running dgl, so that I can debug quickly? Because the two machines run distributed PyTorch successfully.

xcwanAndy · October 30, 2020, 2:52pm

Besides, I use cuda runtime 10.2, cuda driver 11.1.