A problem I meet in setup distribute training

yichuan520030910320 · July 15, 2023, 10:17pm

I followed the instructions here https://github.com/dmlc/dgl/blob/master/examples/distributed/graphsage/README.md, specifically Step 3: Launch distributed jobs. The command below launches one process per machine for both sampling and training.

python3 ~/workspace/dgl/tools/launch.py \
--workspace ~/workspace/dgl/examples/pytorch/graphsage/dist/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data/ogbn-products.json \
--ip_config ip_config.txt \
"python3 node_classification.py --graph_name ogbn-products --ip_config ip_config.txt --num_epochs 30 --batch_size 1000"

I ran distributed training with a similar command python3 /home/yw8143/GNN/GNN_acceleration/dist/launch.py --workspace /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json --ip_config ip_config.txt "/scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000" but received the following error, which seems to suggest that lauch.py is now deprecated by PyTorch. I’m not very familiar with distributed training, can anyone help me with this error?


The number of OMP threads per trainer is set to 80
/home/yw8143/GNN/GNN_acceleration/dist/launch.py:148: DeprecationWarning: setDaemon() is deprecated, set the daemon attribute instead
  thread.setDaemon(True)
cleanupu process runs
/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[18:08:23] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created.
[18:08:23] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created.
bash: line 1: 3631882 Bus error               (core dumped) /scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.0.3.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc PYTHONPATH=:..  DGL_SERVER_ID=1; /scratch/yw8143/miniconda3/envs/GNNN/bin/python train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 135.
usage: train_dist.py [-h] [--graph_name GRAPH_NAME] [--id ID]
                     [--ip_config IP_CONFIG] [--part_config PART_CONFIG]
                     [--n_classes N_CLASSES] [--backend BACKEND]
                     [--num_gpus NUM_GPUS] [--num_epochs NUM_EPOCHS]
                     [--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
                     [--fan_out FAN_OUT] [--batch_size BATCH_SIZE]
                     [--batch_size_eval BATCH_SIZE_EVAL]
                     [--log_every LOG_EVERY] [--eval_every EVAL_EVERY]
                     [--lr LR] [--dropout DROPOUT] [--local_rank LOCAL_RANK]
                     [--standalone] [--pad-data]
train_dist.py: error: unrecognized arguments: --local-rank=0
usage: train_dist.py [-h] [--graph_name GRAPH_NAME] [--id ID]
                     [--ip_config IP_CONFIG] [--part_config PART_CONFIG]
                     [--n_classes N_CLASSES] [--backend BACKEND]
                     [--num_gpus NUM_GPUS] [--num_epochs NUM_EPOCHS]
                     [--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
                     [--fan_out FAN_OUT] [--batch_size BATCH_SIZE]
                     [--batch_size_eval BATCH_SIZE_EVAL]
                     [--log_every LOG_EVERY] [--eval_every EVAL_EVERY]
                     [--lr LR] [--dropout DROPOUT] [--local_rank LOCAL_RANK]
                     [--standalone] [--pad-data]
train_dist.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3631958) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 3631959) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
Traceback (most recent call last):
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 196, in _run_module_as_main
Traceback (most recent call last):
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-15_18:08:26
  host      : ga028.hpc.nyu.edu
  rank      : 1 (local_rank: 0)
  exitcode  : 2 (pid: 3631959)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
    return _run_code(code, main_globals, None,
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dist.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-15_18:08:26
  host      : ga028.hpc.nyu.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 3631958)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.32.35.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=80 DGL_GROUP_ID=0 PYTHONPATH=:.. ; /scratch/yw8143/miniconda3/envs/GNNN/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=10.32.35.204 --master_port=1234 train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 1.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 10.0.3.204 'cd /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/data/ogb-arxiv.json DGL_IP_CONFIG=ip_config.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=80 DGL_GROUP_ID=0 PYTHONPATH=:.. ; /scratch/yw8143/miniconda3/envs/GNNN/bin/python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=10.32.35.204 --master_port=1234 train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000)'' returned non-zero exit status 1.
^C2023-07-15 18:08:57,407 INFO Stop launcher
^C2023-07-15 18:08:58,249 INFO Stop launcher
Exception ignored in atexit callback: <function _exit_function at 0x7fd1b0173400>
Traceback (most recent call last):
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/home/yw8143/GNN/GNN_acceleration/dist/launch.py", line 636, in signal_handler
    sys.exit(0)
SystemExit: 0

yichuan520030910320 · July 16, 2023, 2:30am

I am conducting experiments on HPC, so I believe my configuration is single-machine multi-GPU. Therefore, my ip_config.txt is as follows(i do not know if it is right):

127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1

I’m using the latest DGL, Python 3.10, and the command is:

bashCopy code

python3 launch.py --workspace /home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist --num_trainers 1 --num_samplers 0 --num_servers 1 --part_config data/ogb-arxiv.json --ip_config ip_config.txt "/scratch/yw8143/miniconda3/envs/GNNN/bin/python  train_dist.py --graph_name ogb-arxiv --ip_config ip_config.txt --num_epochs 30 --batch_size 1000 --num_gpu=4"

But I’m getting a lot of error messages. like .start graph service on server 3 for part 3 Server is waiting for connections on [127.0.0.1:30050]... [22:29:10] /opt/dgl/src/rpc/rpc.cc:141: Sender with NetType~socket is created. [22:29:10] /opt/dgl/src/rpc/rpc.cc:161: Receiver with NetType~socket is created. [22:29:10] /opt/dgl/src/rpc/network/tcp_socket.cc:86: Failed bind on 127.0.0.1:30050 , error: Address already in use Traceback (most recent call last): File "/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/train_dist.py", line 417, in <module> main(args) File "/home/yw8143/GNN/GNN_acceleration/dist/DGLexample/dist/train_dist.py", line 295, in main dgl.distributed.initialize(args.ip_config) File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/distributed/dist_context.py", line 278, in initialize serv.start() File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/distributed/dist_graph.py", line 471, in start start_server( File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/distributed/rpc_server.py", line 101, in start_server rpc.wait_for_senders( File "/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/distributed/rpc.py", line 195, in wait_for_senders _CAPI_DGLRPCWaitForSenders(ip_addr, int(port), int(num_senders), blocking) File "dgl/_ffi/_cython/./function.pxi", line 295, in dgl._ffi._cy3.core.FunctionBase.__call__ File "dgl/_ffi/_cython/./function.pxi", line 241, in dgl._ffi._cy3.core.FuncCall dgl._ffi.base.DGLError: [22:29:10] /opt/dgl/src/rpc/network/socket_communicator.cc:240: Cannot bind to 127.0.0.1:30050 Stack trace: [bt] (0) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x75) [0x7fa7f23323b5] [bt] (1) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/libdgl.so(dgl::network::SocketReceiver::Wait(std::string const&, int, bool)+0x33c) [0x7fa7f2848a9c] [bt] (2) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/libdgl.so(+0x8a7d48) [0x7fa7f2852d48] [bt] (3) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/libdgl.so(DGLFuncCall+0x48) [0x7fa7f26c1108] [bt] (4) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x155e3) [0x7fa7f1b975e3] [bt] (5) /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so(+0x15c0b) [0x7fa7f1b97c0b] [bt] (6) /scratch/yw8143/miniconda3/envs/GNNN/bin/python(_PyObject_MakeTpCall+0x25b) [0x4f6c5b] [bt] (7) /scratch/yw8143/miniconda3/envs/GNNN/bin/python(_PyEval_EvalFrameDefault+0x4dde) [0x4f271e] [bt] (8) /scratch/yw8143/miniconda3/envs/GNNN/bin/python(_PyFunction_Vectorcall+0x6f) [0x4fd90f] ' /scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects –local-rankargument to be set, please change it to read fromos.environ[‘LOCAL_RANK’]` instead. See
Distributed communication package - torch.distributed — PyTorch 2.0 documentation for
further instructions

warnings.warn(
/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
Distributed communication package - torch.distributed — PyTorch 2.0 documentation for
further instructions
train_dist.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 867166) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 867165) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 867168) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 867167) of binary: /scratch/yw8143/miniconda3/envs/GNNN/bin/python
Traceback (most recent call last):
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py”, line 196, in
main()
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py”, line 192, in main
launch(args)
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launch.py”, line 177, in launch
run(args)
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/run.py”, line 785, in run
elastic_launch(
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/scratch/yw8143/miniconda3/envs/GNNN/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_dist.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-07-15_22:29:21
host : ga011.hpc.nyu.edu
rank : 1 (local_rank: 0)
exitcode : 2 (pid: 867165)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.0 documentation

yichuan520030910320 · July 16, 2023, 4:07am

By the way, when I look at DGL’s partition_graph , it seems that apart from the inner node, all the sampled are 1-hop neighbors. Wouldn’t this mean that it can’t support GNNs with more than 2 layers? (Otherwise, it would need to visit node information from other partitions during training.)

pubu · July 16, 2023, 5:00pm

lauch.py is now deprecated by PyTorch

This is just a warning. launch.py is a launch script which starts processes on the set of machines provided in the ip_config file using torch.distributed.launch which is now depricated by PyTorch. It doesn’t affect the distributed training. As per PyTorch documentation, torchrun is recommeded to launch distributed process group instead of torch.distributed.launch, but torch.distributed.launch still works fine.

I am conducting experiments on HPC, so I believe my configuration is single-machine multi-GPU.
The way DGL distributed training works is, first the server processes are launched on each machine. You are getting the error tcp_socket.cc:86: Failed bind on 127.0.0.1:30050 , error: Address already in use because you are using a single machine for distributed training. When the first server process is launched, it binds to 127.0.0.1:30050. When the second server process is launched, it tries to bind to the same address and port which are already used by the first server process. That’s why you are getting this error. I haven’t tried distributed training on a single machine but in my opinion this approach won’t work unless you modify the DGL server process code to launch on different ports for each process or you provide different ports in the ip_config file for example 192.168.0.1 30050 192.168.0.1 30051 192.168.0.1 30052…

What you can try instead, is create VMs on a single machine if you have enough resources and then launch the distributed training on the VMs.

yichuan520030910320 · July 16, 2023, 8:18pm

So does this mean that the current DGL does not support single node multi-GPU training? this is the topo of my cluster

nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X SYS SYS SYS SYS 0-3 0-1
GPU1 SYS X SYS SYS SYS 0-3 0-1
GPU2 SYS SYS X SYS SYS 0-3 0-1
GPU3 SYS SYS SYS X SYS 0-3 0-1
mlx5_0 SYS SYS SYS SYS X

yichuan520030910320 · July 16, 2023, 9:05pm

Also the problem of local_rank is an error rather than warning as I can see in the error message that train_dist.py: error: unrecognized arguments: --local-rank=0

And I think the problem comes from the torch compatibility，I think lauch.py is not compatible with the latest torch or maybe python 3.10…

this is the argv I recieved in train_dist.py : [’–local-rank=1’, ‘–graph_name’, ‘ogb-arxiv’, ‘–ip_config’, ‘ip_config.txt’, ‘–num_epochs’, ‘30’, ‘–batch_size’, ‘1000’, ‘–num_gpus’, ‘4’]
here is the complete error message
usage: train_dist.py [-h] [–graph_name GRAPH_NAME] [–id ID]
[–ip_config IP_CONFIG] [–part_config PART_CONFIG]
[–n_classes N_CLASSES] [–backend BACKEND]
[–num_gpus NUM_GPUS] [–num_epochs NUM_EPOCHS]
[–num_hidden NUM_HIDDEN] [–num_layers NUM_LAYERS]
[–fan_out FAN_OUT] [–batch_size BATCH_SIZE]
[–batch_size_eval BATCH_SIZE_EVAL]
[–log_every LOG_EVERY] [–eval_every EVAL_EVERY]
[–lr LR] [–dropout DROPOUT] [–local_rank KEY=VAL]
[–standalone] [–pad-data]
train_dist.py: error: unrecognized arguments: --local-rank=1

And the argv ’–local-rank=1 is very strange I think we should modify lauch.py?

pubu · July 17, 2023, 5:43am

Does this mean that the current DGL does not support single node multi-GPU training?

DGL Supports Single-Node Multi-GPU training. You don’t need DistDGL for Single-Node Multi-GPU training. Refer to the following documentation:

Rhett-Ying · July 17, 2023, 8:25am

Yes. We have a ticket tracking it: [DistDGL] Update the launch script to use `torchrun` · Issue #5493 · dmlc/dgl · GitHub.

yichuan520030910320 · July 17, 2023, 2:36pm

@Rhett-Ying @pubu Thank you very much. Additionally, I would like to ask, if I am using multi-machine multi-GPU setup, should I use distDGL or is it better to use GitHub - awslabs/graphstorm: Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.?

pubu · July 18, 2023, 10:04am

If you are training on Multi-Machine Multi-Gpus, you can use DistDGL. Its very easy to setup. The launch script takes care of everything. For miltiple gpus, all you need to do is specify the number or gpus in the launch command.

P.S. i don’t have experience with graphstorm, so i can’t comment on that.

Rhett-Ying · July 21, 2023, 12:30am

DistDGL is the native implementation of distributed DGL while GraphStorm is built upon DistDGL and offers more high-level encapsulations which facilitate users in graph machine learning. You could start with DistDGL and related examples and try with GraphStorm to see if it fits more.

system · August 20, 2023, 12:30am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.