I followed the instructions here https://github.com/dmlc/dgl/blob/master/examples/distributed/graphsage/README.md, specifically Step 3: Launch distributed jobs. The command below launches one process per machine for both sampling and training.
python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogbn-products.json \
--ip_config ip_config.txt \
"python3 node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
I ran distributed training with a similar command python3 /home/yw8143/GNN/GNN_acceleration/dist/launch.py --workspace /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json --ip_config ip_config.txt "/scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"
but received the following error, which seems to suggest that lauch.py is now deprecated by PyTorch. Iâm not very familiar with distributed training, can anyone help me with this error?
The number of OMP threads per trainer is set to 80
/home/yw8143/GNN/GNN_acceleration/dist/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead
thread.setDaemon(True)
cleanupu process runs
/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[18:08:23] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[18:08:23] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
bash: line 1: 3631882 Bus error (core dumped) /scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.0.3.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc PYTHONPATH=:.. DGL_SERVER_ID=1; /scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 135.
usage: train_dist.py [-h] [--graph_name GRAPH_NAME] [--id ID]
[--ip_config IP_CONFIG] [--part_config PART_CONFIG]
[--n_classes N_CLASSES] [--backend BACKEND]
[--num_gpus NUM_GPUS] [--num_epochs NUM_EPOCHS]
[--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
[--fan_out FAN_OUT] [--batch_size BATCH_SIZE]
[--batch_size_eval BATCH_SIZE_EVAL]
[--log_every LOG_EVERY] [--eval_every EVAL_EVERY]
[--lr LR] [--dropout DROPOUT] [--local_rank LOCAL_RANK]
[--standalone] [--pad-data]
train_dist.py: error: unrecognized arguments: --local-rank=0
usage: train_dist.py [-h] [--graph_name GRAPH_NAME] [--id ID]
[--ip_config IP_CONFIG] [--part_config PART_CONFIG]
[--n_classes N_CLASSES] [--backend BACKEND]
[--num_gpus NUM_GPUS] [--num_epochs NUM_EPOCHS]
[--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
[--fan_out FAN_OUT] [--batch_size BATCH_SIZE]
[--batch_size_eval BATCH_SIZE_EVAL]
[--log_every LOG_EVERY] [--eval_every EVAL_EVERY]
[--lr LR] [--dropout DROPOUT] [--local_rank LOCAL_RANK]
[--standalone] [--pad-data]
train_dist.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3631958) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3631959) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
Traceback (most recent call last):
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 196, in _run_module_as_main
Traceback (most recent call last):
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-15_18:08:26
host : ga028.hpc.nyu.edu
rank : 1 (local_rank: 0)
exitcode : 2 (pid: 3631959)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
return _run_code(code, main_globals, None,
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-07-15_18:08:26
host : ga028.hpc.nyu.edu
rank : 0 (local_rank: 0)
exitcode : 2 (pid: 3631958)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.32.35.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=80 DGL_GROUP_ID=0 PYTHONPATH=:.. ; /scratch/yw8143/miniconda3/envs/GNNN/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=10.32.35.204 --master_port=1234 train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 1.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.0.3.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=80 DGL_GROUP_ID=0 PYTHONPATH=:.. ; /scratch/yw8143/miniconda3/envs/GNNN/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=10.32.35.204 --master_port=1234 train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 1.
^C2023-07-15 18:08:57,407 INFO Stop launcher
^C2023-07-15 18:08:58,249 INFO Stop launcher
Exception ignored in atexit callback: <function _exit_function at 0x7fd1b0173400>
Traceback (most recent call last):
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
p.join()
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/home/yw8143/GNN/GNN_acceleration/dist/launch.py", line 636, in signal_handler
sys.exit(0)
SystemExit: 0