Time out when use distributed setting

Hi,

I followed steps in README(and passwordless SSH login setting), trying to do distributed training with two virtual machines each with one nvidia tesla p100 gpu. However, the connection timed out error keeps coming up.

python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/  --num_trainers 2  --num_samplers 0  --num_servers 1 --part_config data/ogb-product.json --ip_config ip_config.txt --keep_alive --server_name long_live "python3 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1"
Servers will keep alive even clients exit...
The number of OMP threads per trainer is set to 4
Monitor file for alive servers already exist: /tmp/dgl_dist_monitor_long_live.
Use running server long_live.
cleanupu process runs
ssh: connect to host 172.31.2.66 port 22: Connection timed out
ssh: connect to host 172.31.1.191 port 22: Connection timed out
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
    self._target(*self._args, **self._kwargs)
  File "/home/luyc/workspace/dgl/tools/launch.py", line 109, in run
  File "/home/luyc/workspace/dgl/tools/launch.py", line 109, in run
    subprocess.check_call(ssh_cmd, shell=True)
    subprocess.check_call(ssh_cmd, shell=True)
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
  File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 172.31.2.66 'cd /home/luyc/workspace/dgl/examples/pytorch/graphsage/dist/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=data/ogb-product.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=1 ; python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=172.31.2.66 --master_port=1234 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1)'' returned non-zero exit status 255.
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 172.31.1.191 'cd /home/luyc/workspace/dgl/examples/pytorch/graphsage/dist/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=4 DGL_CONF_PATH=data/ogb-product.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=1 ; python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 --master_addr=172.31.2.66 --master_port=1234 train_dist.py --graph-name ogb-product --ip_config ip_config.txt --num-epochs 5 --batch-size 1000 --num_workers 0 --num_gpus 1)'' returned non-zero exit status 255.

OS: ubuntu 20.04
Python: v3.8.10
When running ping <nfs-server-ip> on the client machine, it shows it does connect to the server machine. I’m wondering how I can address the error. Any help is appreciated. Thank you!

Ah I figured it out. The error came from me adding ip_config.txt to dir ~/workspace/dgl instead of editing the one in the dir dgl/examples/pytorch/graphsage/dist.

May I ask what is the expected runtime for python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_trainers 2 --num_samplers 0 --num_servers 1 --part_config data/ogb-product.json --ip_config ip_config.txt --keep_alive --server_name long_live "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 5 --batch_size 1000 --num_gpus 2"?
It seems it takes way too long here

Servers will keep alive even clients exit...
The number of OMP threads per trainer is set to 4
Monitor file for alive servers already exist: /tmp/dgl_dist_monitor_long_live.
Use running server long_live.
cleanupu process runs
Namespace(backend='gloo', batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='ogb-product', id=None, ip_config='ip_config.txt', local_rank=1, log_every=20, lr=0.003, n_classes=None, net_type='socket', num_clients=None, num_epochs=5, num_gpus=2, num_hidden=16, num_layers=2, pad_data=False, part_config=None, standalone=False)
dgltest Initializing DGL dist
Namespace(backend='gloo', batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='ogb-product', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, net_type='socket', num_clients=None, num_epochs=5, num_gpus=2, num_hidden=16, num_layers=2, pad_data=False, part_config=None, standalone=False)
dgltest Initializing DGL dist
[21:06:55[] /opt/dgl/src/rpc/rpc.cc:21:06:55] 128/opt/dgl/src/rpc/rpc.cc: :Sender with NetType~socket is created.128
: Sender with NetType~socket is created.
[21:06:55[] 21:06:55/opt/dgl/src/rpc/rpc.cc] :/opt/dgl/src/rpc/rpc.cc147:: 147Receiver with NetType~: socketReceiver with NetType~ is created.socket
 is created.
Namespace(backend='gloo', batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='ogb-product', id=None, ip_config='ip_config.txt', local_rank=1, log_every=20, lr=0.003, n_classes=None, net_type='socket', num_clients=None, num_epochs=5, num_gpus=2, num_hidden=16, num_layers=2, pad_data=False, part_config=None, standalone=False)
distdgltest Initializing DGL dist
Namespace(backend='gloo', batch_size=1000, batch_size_eval=100000, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='ogb-product', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, net_type='socket', num_clients=None, num_epochs=5, num_gpus=2, num_hidden=16, num_layers=2, pad_data=False, part_config=None, standalone=False)
distdgltest Initializing DGL dist
[21:06:55] /opt/dgl/src/rpc/rpc.cc:128: Sender with NetType~socket is created.
[21:06:55] /opt/dgl/src/rpc/rpc.cc:147: Receiver with NetType~socket is created.
[21:06:55] /opt/dgl/src/rpc/rpc.cc:128: Sender with NetType~socket is created.
[21:06:55] /opt/dgl/src/rpc/rpc.cc:147: Receiver with NetType~socket is created.
[21:16:55] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.142.0.3:30050
...
[21:27:07] /opt/dgl/src/rpc/network/socket_communicator.cc:80: Trying to connect receiver: 10.142.0.3:30050

It does not seem right. (if I deleted the keep_alive and server_name flag, there is a cuda error: invalid ordinal)

This might be related to [DistDGL] Program hangs/crashes sometimes due to unknown reason Β· Issue #3881 Β· dmlc/dgl Β· GitHub if you are using an old version. Did you upgrade to 0.9?

please remove them. I don’t think you need keep alive feature in your case. And you said cuda error invalid ordinal is thrown. could you share the callstacks?

I did not, but it seems to work after removing --keep alive with python3 ~/workspace/dgl/tools/launch.py --workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ --num_trainers 2 --num_samplers 0 --num_servers 1 --part_config data/ogb-product.json --ip_config ip_config.txt --server_name long_live "python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 5 --batch_size 1000 --num_gpus 1". Avg runtime per epoch for distdgl was about 35sec. Does this seem right?

After removing only --keep alive, it seems to be fine. I think the cuda error invalid ordinal came from --num_gpus 2 at the end of my previous command since my machine only has one gpu. Just to clarify, is the command in the double quotation used as command that we normally used in terminal when it is single machine training?

the quoted part is for launching each trainers/servers and train_dist.py is created for distributed(including standalone mode) on purpose, so it’s not exactly the one we usually for single machine though they’re very like.

I think it looks ok. and pls launch via --num_trainers=1 as you have only 1 gpu on each machine.

sounds great! Thank you so much for your help.

One quick question, I also wanna try papers100M on single machine dgl and mag240m on distdgl. Is there any runnable examples provided? I was not able to find one on github.

dgl/examples/pytorch/graphsage/dist at master Β· dmlc/dgl Β· GitHub train_dist.py could be used for dist. As for non-dist ones, just refer to upper directory.

Ah I modified DglNodePropPredDataset('ogbn-products') in line 130 of node_classification.py to DglNodePropPredDataset('ogbn-papers100M') and tried to use python3 dgl/examples/pytorch/graphsage/node_classification.py for papers100M but it was killed after running about 4 hours.
Terminal output:

Training in mixed mode.
Loading data
Loading necessary files...
This might take a while.
Killed

The results of sudo dmesg -T| grep -E -i -B100 'killed process' is

[Wed Jul 27 22:11:46 2022] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1005.slice/session-2.scope,task=python3,pid=1791,uid=1005
[Wed Jul 27 22:11:46 2022] Out of memory: Killed process 1791 (python3) total-vm:89047608kB, anon-rss:60503696kB, file-rss:1980kB, shmem-rss:0kB, UID:1005 pgtables:119488kB oom_score_adj:0

You need to have large enough RAM both for CPU/GPU. below is the data size of ogbn-papers100M after converting to dgl format.

79G Jul 26 07:00 dgl_data_processed

Ah that makes sense. Now I have MemAvailable: 60455684 kB and about 15.9GB memory for GPU. I tried again, which gives in about 10 mins

Training in mixed mode.
Loading data
Loading necessary files...
This might take a while.
Processing graphs...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 13231.24it/s]
Converting graphs into DGL objects...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:03<00:00,  3.86s/it]
Saving...
Killed

Then, I run the same command again, it gave seg fault now. I’m wondering if this is from me not modifying the script correctly or it’s the problem of not having enough memory?

Training in mixed mode.
Loading data
Traceback (most recent call last):
  File "dgl/examples/pytorch/graphsage/node_classification.py", line 128, in <module>
    dataset = AsNodePredDataset(DglNodePropPredDataset('ogbn-papers100M'))
  File "/home/luyc/.local/lib/python3.8/site-packages/ogb/nodeproppred/dataset_dgl.py", line 69, in __init__
    self.pre_process()
  File "/home/luyc/.local/lib/python3.8/site-packages/ogb/nodeproppred/dataset_dgl.py", line 76, in pre_process
    self.graph, label_dict = load_graphs(pre_processed_file_path)
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 182, in load_graphs
    return load_graph_v2(filename, idx_list)
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 194, in load_graph_v2
    return [gdata.get_graph() for gdata in heterograph_list], label_dict
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/data/graph_serialize.py", line 194, in <listcomp>
    return [gdata.get_graph() for gdata in heterograph_list], label_dict
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/data/heterograph_serialize.py", line 54, in get_graph
    ndict = {ntensor[i]: F.zerocopy_from_dgl_ndarray(ntensor[i+1]) for i in range(0, len(ntensor), 2)}
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/data/heterograph_serialize.py", line 54, in <dictcomp>
    ndict = {ntensor[i]: F.zerocopy_from_dgl_ndarray(ntensor[i+1]) for i in range(0, len(ntensor), 2)}
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/backend/pytorch/tensor.py", line 359, in zerocopy_from_dgl_ndarray
    if data.shape == (0,):
  File "/home/luyc/.local/lib/python3.8/site-packages/dgl/_ffi/ndarray.py", line 177, in shape
    return tuple(self.handle.contents.shape[i] for i in range(self.handle.contents.ndim))
AttributeError: 'NoneType' object has no attribute 'contents'
Segmentation fault (core dumped)

I tried from my side and it needs about 206G in peak. I checked via below utility:

I have also finished the train and you need to made below modification for train/test or you will hit dtype error:
y = blocks[-1].dstdata['label'].long()
ys.append(blocks[-1].dstdata['label'].long())

Thank you! I increased the memory to 240GB and the training stage works. But the testing stage gives CUDA oom error which I think it may be my GPU has just 16GB memory.

1 Like