Problem
I’m trying to run the example of distributed training from: https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/experimental
I set the cluster information properly but find the program pendng.
As I dig into the code, I found it pending at loss.backward
: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py#L226
which indicates that the program pends when implementing backward propagation. See logs blow:
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Namespace(batch_size=1000, batch_size_eval=10000, close_profiler=False, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='reddit', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=1, num_hidden=16, num_layers=2, num_servers=2, num_workers=4, part_config=None, standalone=False)
Warning!
Interface: enp1s0f1
IP address not available for interface.
Warning!
Interface: rdma1
IP address not available for interface.
Warning!
Interface: rdma3
IP address not available for interface.
Warning!
Interface: virbr0-nic
IP address not available for interface.
Machine (1) client (0) connect to server successfuly!
Machine (0) client (5) connect to server successfuly!
rank: 1
rank: 0
part 1, train: 76715 (local: 76715), val: 11915 (local: 11915), test: 27851 (local: 27851)
part 0, train: 76716 (local: 74411), val: 11916 (local: 11596), test: 27852 (local: 27015)
#labels: 41
#labels: 41
Note that the pending isn’t due to the large dataset, as it’s been a whole day for me to check if there is any progress of the program, but it ends up with nothing.
My cluster information is listed below:
- Two machines, each equipped with a V100 GPU.
- The content of “ip_config.txt” is:
192.168.1.215
192.168.1.211
I’m in urgent need to run this example successfully, so please help me with this issue please. Thanks.