Examples of distributed training

onepiecewiley · July 19, 2024, 3:58pm

Now, I want to run the graphsage distributed code in the examples/distributed directory, but I don’t have an actual machine, so I used vmware to build three virtual machines as nodes for distributed training. However, although I followed the readme to deploy the environment, set up NFS, etc., I found that I would get an error after running the code on node0 (the main node), saying
"(fordgl) wiley@wiley-virtual-machine:/home/ubuntu/workspace$ python /home/ubuntu/workspace/dgl/tools/launch.py –workspace /home/ubuntu/workspace/dgl/examples/distributed/graphsage/ --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config data/reddit.json --ip_config ip_config.txt “python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000”
The number of OMP threads per trainer is set to 2
/home/ubuntu/workspace/dgl/tools/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead thread.setDaemon(True) Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.128 ‘c d /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=0; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. Traceback (most recent call last): File “node_classification.py”, line 5, in import dgl ModuleNotFoundError: No module named ‘dgl’ Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; python3 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
Traceback (most recent call last):
File “node_classification.py”, line 5, in
import dgl
ModuleNotFoundError: No module named ‘dgl’
Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=2; python3 node_classification.py --graph_name red dit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85. 128 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_G RAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=0 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epoch s 30 --batch_size 1000)’’ returned non-zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.130 ‘cd /home/ubuntu/workspace/dgl/examples/distribute d/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=1 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non- zero exit status 1. /usr/bin/python3: Error while finding module specification for ‘torch.distributed.run’ (ModuleNotFoundError: No module named ‘torch’) Called process error Command ‘ssh -o StrictHostKeyChecking=no -p 22 192.168.85.131 ‘cd /home/ubuntu/workspace/dgl/examples/distributed/graphsage/; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=3 DGL_CONF_PATH=data/reddit.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=2 DGL_GROUP_ID=0 ; python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=3 --node_rank=2 --master_addr=192.168.85.128 --master_port=1234 node_classification.py --graph_name reddit --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)’’ returned non-zero exit status 1.
cleanup process runs
Task failed”
I did some preliminary investigation and the error message said that the dgl package and torch package could not be found. I found that after executing the run command, the machine itself used the python interpreter in /usr/bin, instead of the fordgl environment named fordgl that I created with conda (this environment has all the packages). I set the environment variables, but it still didn’t help. Every time an error is reported, the python interpreter in /usr/bin is used instead of the python interpreter in the fordgl environment. Now I don’t know what to do. How can I solve this problem?

onepiecewiley · July 19, 2024, 4:02pm

Does the lanuch.py script use the python interpreter in /usr/bin by default? How to solve this problem?

Rhett-Ying · July 25, 2024, 12:46am

conda env is not used when launch distributed training. If you want to use conda env, you could try to specify the conda python like `python launch.py … “conda_python node_classification.py xxx” . I am not sure if work.

system · August 24, 2024, 12:47am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.