Examples of distributed training

Now, I want to run the graphsage distributed code in the examples/distributed directory, but I don’t have an actual machine, so I used vmware to build three virtual machines as nodes for distributed training. However, although I followed the readme to deploy the environment, set up NFS, etc., I found that I would get an error after running the code on node0 (the main node), saying
"(fordgl) wiley@wiley-virtual-machine:/home/ubuntu/workspace$ python /home/ubuntu/workspace/dgl/tools/launch.py ​​–workspace /home/ubuntu/workspace/dgl/examples/distributed/graphsage/ --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config data/reddit.json --ip_config ip_config.txt “python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000”
The number of OMP threads per trainer is set to 2
/home/ubuntu/workspace/dgl/tools/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead thread.setDaemon(True) Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.128 ‘c d /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
Traceback (most recent call last):
File “node_classification.py”, line 5, in
import dgl
ModuleNotFoundError: No module named ‘dgl’
Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=2; python3 node_classification.py --graph_name red dit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85. 128 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_G RAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epoch s 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distribute d/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non- zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
cleanup process runs
Task failed”
I did some preliminary investigation and the error message said that the dgl package and torch package could not be found. I found that after executing the run command, the machine itself used the python interpreter in /usr/bin, instead of the fordgl environment named fordgl that I created with conda (this environment has all the packages). I set the environment variables, but it still didn’t help. Every time an error is reported, the python interpreter in /usr/bin is used instead of the python interpreter in the fordgl environment. Now I don’t know what to do. How can I solve this problem?

Does the lanuch.py ​​script use the python interpreter in /usr/bin by default? How to solve this problem?

conda env is not used when launch distributed training. If you want to use conda env, you could try to specify the conda python like `python launch.py … “conda_python node_classification.py xxx” . I am not sure if work.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.