Distributed training not working on two Apple IMac devices

I am following the tutorial here but I am not able to get it to work. Everytime I try to run the command from my master computer, I get the following error:

(dist-gnn) ccsp-admin@CCSPadminsiMac workspace % python3 ~/workspace/dgl/tools/launch.py  --ssh_username ccsp-admin --workspace ~/workspace/   --num_trainers 1   --num_samplers 0   --num_servers 1   --part_config 2part_data/ogbn-proteins.json   --ip_config ip_config.txt   "python3 train_dist.py"
The number of OMP threads per trainer is set to 4
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=0; python3 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=1; python3 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'
cleanupu process runs
zsh:1: command not found: torchrun
zsh:1: command not found: conda
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'' returned non-zero exit status 127.

here
Traceback (most recent call last):
  File "train_dist.py", line 3, in <module>
    import torch as th
ModuleNotFoundError: No module named 'torch'
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=0; python3 train_dist.py)'' returned non-zero exit status 1.

zsh:1: command not found: conda
zsh:1: command not found: torchrun
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 ; torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=192.168.50.200 --master_port=1234 train_dist.py)'' returned non-zero exit status 127.

Traceback (most recent call last):
  File "train_dist.py", line 2, in <module>
    import torch as th
ModuleNotFoundError: No module named 'torch'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.134 'cd /Users/ccsp-admin/workspace/; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=1; python3 train_dist.py)'' returned non-zero exit status 1.

My devices are:

Two iMac (M1, 2021) with macOS Monterey

I am not using nfsd as both devices have the file stored in there.

This is how I installed dgl and torch in both of my devices:

conda create --name dist-gnn python=3.8
conda install pytorch torchvision -c pytorch
conda install -c dglteam dgl

I have set up passwordless ssh on both the devices.

The strange thing is that when I manually login to the devices I can easily import torch but I am not sure why the subprocess is failing. Any help would be really helpful!

PS: I did updated the original launch script to call conda activate dist-gnn and print some things for debugging.

It seems that conda cannot be activated if you use remote ssh cmd. Could you verify that first?

When I manually ssh to a machine2 from machine1, i can activate conda just fine and can also import torch module and verify it just like here.

How about sth. like
ssh -o StrictHostKeyChecking=no -p 22 ccsp-admin@192.168.50.200 "conda activate dist-gnn; python -c 'import torch'"

If the above fails, then it perhaps is related to the ssh setting rather than DGL.

Ok I was able to make that run, I just needed to add . ~/.zshrc before my conda script to make it run.

But when it ran, it gave me another error which I am not sure why is happening. The error is below, do you know the reason behind it? Thanks

Traceback (most recent call last):
  File "train_dist.py", line 13, in <module>
    dgl.distributed.initialize(ip_config='ip_config.txt')
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 261, in initialize
    connect_to_server(ip_config, num_servers, max_queue_size, net_type)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/rpc_client.py", line 141, in connect_to_server
    rpc.register_sig_handler()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 992, in register_sig_handler
    _CAPI_DGLRPCHandleSignal()
NameError: name '_CAPI_DGLRPCHandleSignal' is not defined
Traceback (most recent call last):
  File "train_dist.py", line 14, in <module>
    dgl.distributed.initialize(ip_config='ip_config.txt')
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/dist_context.py", line 261, in initialize
    connect_to_server(ip_config, num_servers, max_queue_size, net_type)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/rpc_client.py", line 141, in connect_to_server
    rpc.register_sig_handler()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/site-packages/dgl/distributed/rpc.py", line 992, in register_sig_handler
    _CAPI_DGLRPCHandleSignal()
NameError: name '_CAPI_DGLRPCHandleSignal' is not defined
libc++abi: terminating with uncaught exception of type dmlc::Error: [16:43:32] /tmp/dgl_src/src/runtime/shared_mem.cc:55: Check failed: munmap(ptr_, size_) != -1: Invalid argument
Stack trace:
  [bt] (0) 1   libdgl.dylib                        0x000000013755a14f dmlc::LogMessageFatal::~LogMessageFatal() + 111
  [bt] (1) 2   libdgl.dylib                        0x0000000137f2f516 dgl::runtime::SharedMemory::~SharedMemory() + 150
  [bt] (2) 3   libdgl.dylib                        0x0000000137f738c0 dgl::HeteroGraph::CopyToSharedMem(std::__1::shared_ptr<dgl::BaseHeteroGraph>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) + 7104
  [bt] (3) 4   libdgl.dylib                        0x0000000137f8ab15 std::__1::__function::__func<dgl::$_45, std::__1::allocator<dgl::$_45>, void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)>::operator()(dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) + 677
  [bt] (4) 5   libdgl.dylib                        0x0000000137f12fb8 DGLFuncCall + 72
  [bt] (5) 6   core.cpython-37m-darwin.so          0x000000010eea01a5 __pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) + 965
  [bt] (6) 7   core.cpython-37m-darwin.so          0x000000010eea43f4 __pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) + 52
  [bt] (7) 8   python3.7                           0x0000000104071bbb _PyObject_FastCallKeywords + 683
  [bt] (8) 9   python3.7                           0x000000010417a565 call_function + 725


Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 192.168.1.98 'cd /Users/ccsp-admin/workspace/; . ~/.zshrc; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=1; python3 train_dist.py)'' returned non-zero exit status 255.

libc++abi: terminating with uncaught exception of type dmlc::Error: [16:43:33] /tmp/dgl_src/src/runtime/shared_mem.cc:55: Check failed: munmap(ptr_, size_) != -1: Invalid argument
Stack trace:
  [bt] (0) 1   libdgl.dylib                        0x000000016b37714f dmlc::LogMessageFatal::~LogMessageFatal() + 111
  [bt] (1) 2   libdgl.dylib                        0x000000016bd4c516 dgl::runtime::SharedMemory::~SharedMemory() + 150
  [bt] (2) 3   libdgl.dylib                        0x000000016bd908c0 dgl::HeteroGraph::CopyToSharedMem(std::__1::shared_ptr<dgl::BaseHeteroGraph>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, std::__1::set<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) + 7104
  [bt] (3) 4   libdgl.dylib                        0x000000016bda7b15 std::__1::__function::__func<dgl::$_45, std::__1::allocator<dgl::$_45>, void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)>::operator()(dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) + 677
  [bt] (4) 5   libdgl.dylib                        0x000000016bd2ffb8 DGLFuncCall + 72
  [bt] (5) 6   core.cpython-37m-darwin.so          0x000000012bc591a5 __pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) + 965
  [bt] (6) 7   core.cpython-37m-darwin.so          0x000000012bc5d3f4 __pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) + 52
  [bt] (7) 8   python3.7                           0x00000001021f4bbb _PyObject_FastCallKeywords + 683
  [bt] (8) 9   python3.7                           0x00000001022fd565 call_function + 725


Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/ccsp-admin/workspace/dgl/tools/launch.py", line 112, in run
    subprocess.check_call(ssh_cmd, shell=True)
  File "/Users/ccsp-admin/opt/anaconda3/envs/dist-gnn/lib/python3.7/subprocess.py", line 363, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no -p 22 192.168.1.237 'cd /Users/ccsp-admin/workspace/; . ~/.zshrc; conda activate dist-gnn; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=2part_data/ogbn-proteins.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=0; python3 train_dist.py)'' returned non-zero exit status 255.

Which version of DGL are you using?

I am using 0.7.2 and using Pytorch as backend.

Hi @kartikeyas00 , after digging the code a bit. We found the error is caused by dgl/rpc.cc at d798280f198ae17ca39680d6167d3e6b1b6b43e1 · dmlc/dgl · GitHub , where it skips registering signal handlers if on a non-linux OS. That’s currently a limitation unfortunately. We will try to see if we could improve it in the future releases.

Should I open an issue on GitHub? Also, do you know when it will be resolved, like a timeline? I would appreciate it!

Yeah, opening an issue will be great. We’ll add that to the project tracker will let you know when it is scheduled.

I have submitted a new issue here