Distributed training is hung when using infiniband network interface

I am trying to do distributed training with DGL and PyTorch backend but the training is hanging for some reason and then timing out. Below is the trace:

ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.74 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=0; module load miniconda;     conda activate gnn;     python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'


ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.75 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=1; module load miniconda;     conda activate gnn;     python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'


ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.77 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=2; module load miniconda;     conda activate gnn;     python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'


ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.78 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_KEEP_ALIVE=0  DGL_SERVER_ID=3; module load miniconda;     conda activate gnn;     python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'

('172.20.8.74', 1234)

ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.74 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda;     conda activate gnn;     python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=0 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'

('172.20.8.75', 1234)

ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.75 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda;     conda activate gnn;     python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=1 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'

('172.20.8.77', 1234)

ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.77 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda;     conda activate gnn;     python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=2 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'

('172.20.8.78', 1234)

ssh -o StrictHostKeyChecking=no -p 22 ksharma2@172.20.8.78 'cd /home/ksharma2/; (export GLOO_SOCKET_IFNAME=ib0; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=netsec/4part_data/twibot-dataset.json DGL_IP_CONFIG=RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=14 DGL_GROUP_ID=0 ; module load miniconda;     conda activate gnn;     python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=4 --node_rank=3 --master_addr=172.20.8.74 --master_port=1234 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py))'

Warning: Permanently added '172.20.8.78' (ECDSA) to the list of known hosts.
cleanupu process runs

Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".


Lmod is automatically replacing "intel/17" with "gcc/7.3".

[05:10:31] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.74:30050].
Warning! Interface: em2 
IP address not available for interface.
[05:10:31] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.78:30050].
Warning! Interface: em2 
IP address not available for interface.
[05:10:32] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.77:30050].
Warning! Interface: em2 
IP address not available for interface.
[05:10:33] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.75:30050].
Warning! Interface: em2 
IP address not available for interface.
[05:10:33] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.75:43344].
Client [29424] waits on 172.20.8.75:43344
[05:10:34] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.77:50650].
Client [31818] waits on 172.20.8.77:50650
[05:10:35] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.78:47348].
Client [10965] waits on 172.20.8.78:47348
Client [4918] waits on 172.20.8.74:46805
[05:10:35] /opt/dgl/src/rpc/tensorpipe/tp_communicator.cc:98: TPReceiver starts to wait on [tcp://172.20.8.74:46805].
Machine (1) group (0) client (1) connect to server successfuly!
Machine (0) group (0) client (0) connect to server successfuly!
Machine (2) group (0) client (2) connect to server successfuly!
Machine (3) group (0) client (3) connect to server successfuly!
Traceback (most recent call last):
  File "/home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py", line 87, in <module>
    model = torch.nn.parallel.DistributedDataParallel(model)
  File "/home/ksharma2/.conda/envs/gnn/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
Client[0] in group[0] is exiting...
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

This is my run command from the command line

python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/launch.py \

    --workspace /home/ksharma2/ \

    --num_trainers 1 \

    --ssh_username=ksharma2 \

    --num_samplers 0 \

    --num_servers 1 \

    --part_config netsec/4part_data/twibot-dataset.json \

    --extra_envs GLOO_SOCKET_IFNAME=ib0 \

    --ip_config RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt \

    "module load miniconda;\

     conda activate gnn;\

     python3 /home/ksharma2/RHGNN/test/kart/dgl-gan-conv-dist/run.py"

I am using DGL version 0.8.0 and torch 1.10.2+cu102. My linux is Red Hat Enterprise Linux Server release 7.9 (Maipo)

If you want to see ibstat output, it is below:

CA 'mlx5_0'
        CA type: MT4115
        Number of ports: 1
        Firmware version: 12.28.4512
        Hardware version: 0
        Node GUID: 0x7cfe9003002691c0
        System image GUID: 0x7cfe9003002691c0
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 137
                LMC: 0
                SM lid: 11
                Capability mask: 0x2651e848
                Port GUID: 0x7cfe9003002691c0
                Link layer: InfiniBand

this seems to be an issue within torch.distributed. what’s the backend you’re using when initialize process_group: torch.distributed.init_process_group(backend='gloo') ? gloo/nccl/mpi? pls refer to Distributed communication package - torch.distributed — PyTorch 1.11.0 documentation for more details.

what’s more, could you try to make sure torch.distributed works as expected when using infiniband without DGL?

Yes, I am using gloo and in the docs, it says if the IB network has IP-over-IB you should use gloo. Are there any debug utilities provided by DGL or should I rely on the one provided by PyTorch?

Could you try set env variable TORCH_DISTRIBUTED_DEBUG=INFO in the script? To see the detailed log, the error seems related to PyTorch more than DGL

I did but it didn’t output any information at all. It seems like the program is crashing on one node and then it is hung right after that. I used strace as well but didn’t find anything substantial in there either. By the way this is my training script, does it look ok or is there anything wrong with it?

import dgl
import torch
from model import HANBotDetector
from torch.nn import CrossEntropyLoss
import os
import time
import datetime
import tqdm
from sklearn.metrics import accuracy_score



dgl.distributed.initialize(ip_config='RHGNN/test/kart/dgl-gan-conv-dist/ip_config.txt')
torch.distributed.init_process_group(backend='gloo')
# Set the device first
#device = torch.device("cuda")

def F1_score(pred, truth):
    pred_v=torch.argmax(pred, dim = 1)
    tp,fp,fn = 0,0,0

    tp = (pred_v * truth).sum()
    fp = (pred_v * (1 - truth)).sum()
    fn = ((1 - pred_v) * truth).sum()

    try:
        precision = tp / (tp+fp)
    except:
        precision = tp/(tp + fp + 1)
    try:
        recall = tp / (tp+fn)
    except:
        recall = tp/ (tp + fn + 1)

    try:
        f1 = 2*(precision * recall) / (precision + recall)
    except:
        f1 = 2*(precision * recall) / (precision + recall + 1)

    return f1



graph = dgl.distributed.DistGraph('twibot-dataset')

train_nids = dgl.distributed.node_split(graph.ndata['train_mask'])
valid_nids = dgl.distributed.node_split(graph.ndata['val_mask'])

sampler = dgl.dataloading.MultiLayerFullNeighborSampler(2)
train_dataloader = dgl.dataloading.DistNodeDataLoader(
                             graph, train_nids, sampler, batch_size=1024,
                             shuffle=True, drop_last=False)
valid_dataloader = dgl.dataloading.DistNodeDataLoader(
                             graph, valid_nids, sampler, batch_size=1024,
                             shuffle=False, drop_last=False)

########### Hyperparameters ################

LEARNING_RATE = 1e-3
WD = 3e-5  # Weight Decay
DROPOUT_RATE = 0.5

usr_feat = 7
bool_feat = 7
desc_feat = 768
tw_feat = 768

IN_FEATURES = usr_feat+bool_feat+desc_feat+tw_feat

aggregator = 'semantic_attention'

HIDDEN_OUT_FEATURES = 128
MAX_EPOCHS = 40

# Attention heads
NUM_HEAD = [8, 8] #both layer consists of 8 attention heads
#####################################################

# Initialize model
model = HANBotDetector(
    HAN_hid_out=HIDDEN_OUT_FEATURES,
    linear_in_features=IN_FEATURES,
    linear_out_channels=128,
    num_heads=NUM_HEAD,
    dropout_rate=DROPOUT_RATE,
    agg = aggregator,
)
model = torch.nn.parallel.DistributedDataParallel(model)


opt = torch.optim.Adam(model.parameters(),lr=LEARNING_RATE, weight_decay=WD, amsgrad=False)
CELoss = CrossEntropyLoss()

current_dir = os.path.dirname(__file__)

best_accuracy = 0
#best_model_path = f"{current_dir}/model.pt"
best_epoch = -1

start_time = time.time()
# Stochastic Training on Large Graphs 
for epoch in range(MAX_EPOCHS):
    print(epoch)
    model.train()

    with tqdm.tqdm(train_dataloader) as tq:
        for step, (input_nodes, output_nodes, mfgs) in enumerate(tq):
            # feature copy from CPU to GPU takes place here
            inputs = mfgs[0].srcdata["feat"]
            labels = mfgs[-1].dstdata["label"]
            predictions = model(mfgs, inputs)
            loss = CELoss(predictions, labels)
            opt.zero_grad()
            loss.backward()
            opt.step()

            accuracy = accuracy_score(
                labels.cpu(), predictions.argmax(1).cpu()
            )

            tq.set_postfix(
                {"loss": "%.03f" % loss.item(), "acc": "%.03f" % accuracy},
                refresh=False,
            )

    model.eval()

    predictions = []
    labels = []
    with tqdm.tqdm(valid_dataloader) as tq, torch.no_grad():
        for input_nodes, output_nodes, mfgs in tq:
            inputs = mfgs[0].srcdata["feat"]
            label = mfgs[-1].dstdata["label"]
            
            labels.append(label)
            p=model(mfgs, inputs).argmax(1)
            predictions.append(p)
        predictions = torch.cat(predictions)
        labels = torch.cat(labels)
        accuracy = accuracy_score(labels.cpu(), predictions.cpu())
        print("Epoch {} Validation Accuracy {}".format(epoch, accuracy))
        if best_accuracy < accuracy:
            best_accuracy = accuracy
            best_epoch = epoch
            #torch.save(model.state_dict(), best_model_path)
end_time = time.time()
total_time = end_time - start_time
print(f"Best validation accuracy ----> {best_accuracy}")
print(f"Best Epoch ------> {best_epoch}")
print(f"Total Training Time ----> {str(datetime.timedelta(seconds=total_time))} H:MM:SS:μs")
print('\n-------------------------------------------------------------------\n')

What if you try move torch.distributed.init_process_group(backend='gloo') before dgl.distributed.initialize?

Sorry was on spring break! But I did that and still didn’t work, it is calling the init_process_group(backend='gloo') twice when it is placed before dgl.distributed.initialize.

I also try to run this code but it also didn’t work and gave me the same error as I described in my first post.

I do have strace files which I can send to you through email as they are pretty big to attach here.

Hi,

How large is your dataset?

Could try set the network interface name directly by --extra_envs NCCL_DEBUG=INFO GLOO_SOCKET_IFNAME=bond0 TP_SOCKET_IFNAME=bond0 NCCL_SOCKET_IFNAME=bond0 that change bond0 to the nic you want to use

Ok, Finally good news. It worked after I bumped my python version from 3.8 to 3.9 and also reinstall the DGL and PyTorch libraries for Cuda 10.2.

Here are the versions:

Python 3.9.12
DGL 0.8.0post2
PyTorch 1.11.0+cu102

Thanks for all the help.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.