I am following the tutorial here but I am not able to get it to work. Everytime I try to run the command from my master computer, I get the following error:
(dist-gnn) ccsp-admin@CCSPadminsiMac workspace % python3 ~/workspace/dgl/tools/launch.py --ssh_username ccsp-admin --workspace ~/workspace/ --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config 2part_data/ogbn-proteins.json --ip_config ip_config.txt "python3 train_dist.py"
The number of OMP threads per trainer is set to 4
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'
cleanupu process runs
zsh:1: command not found: torchrun
zsh:1: command not found: conda
Exception in thread Thread-3:
Traceback (most recent call last):
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'' returned non-zero exit status 127.
here
Traceback (most recent call last):
File "train_dist.py", line 3, in <module>
import torch as th
ModuleNotFoundError: No module named 'torch'
Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 train_dist.py)'' returned non-zero exit status 1.
zsh:1: command not found: conda
zsh:1: command not found: torchrun
Exception in thread Thread-4:
Traceback (most recent call last):
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'' returned non-zero exit status 127.
Traceback (most recent call last):
File "train_dist.py", line 2, in <module>
import torch as th
ModuleNotFoundError: No module named 'torch'
Exception in thread Thread-2:
Traceback (most recent call last):
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
subprocess.check_call(ssh_cmd, shell=True)
File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 train_dist.py)'' returned non-zero exit status 1.
My devices are:
Two iMac (M1, 2021) with macOS Monterey
I am not using nfsd as both devices have the file stored in there.
This is how I installed dgl
and torch
in both of my devices:
conda create --name dist-gnn python=3.8
conda install pytorch torchvision -c pytorch
conda install -c dglteam dgl
I have set up passwordless ssh on both the devices.
The strange thing is that when I manually login to the devices I can easily import torch
but I am not sure why the subprocess is failing. Any help would be really helpful!
PS: I did updated the original launch script to call conda activate dist-gnn
and print some things for debugging.