Hi,
I have been trying to run DistDGL using this NVIDIA docker container. I am having trouble figuring out the run command. I’m using Slurm on AWS ParallelCluster with 2 g4dn.2xlarge
compute nodes and a t2.micro
head node running alinux2, and all in private subnets.
docker run --gpus all \
--network=host \
--ipc=host \
--privileged \
-v /home/ec2-user:/home/ec2-user \
-v /home/ec2-user/.ssh:/root/.ssh \
-w $PROJ_PATH \
--name dgl_container \
--rm nvcr.io/nvidia/dgl:24.07-py3 \
python3 $PROJ_PATH/launch.py \
--ssh_username ec2-user \
--workspace $PROJ_PATH \
--num_trainers $GPUS_PER_NODE \
--num_samplers $SAMPLER_PROCESSES \
--num_servers 1 \
--part_config $PARTITION_DIR \
--ip_config $IP_CONFIG_FILE \
"docker exec dgl_container python3 baseline/node_classification.py --graph_name $DATASET_NAME \
--backend $BACKEND \
--ip_config $IP_CONFIG_FILE --num_epochs 100 --batch_size 2000 \
--num_gpus $GPUS_PER_NODE --summary_filepath $SUMMARYFILE \
--profile_dir $PROFILE_DIR \
--part_config $PARTITION_DIR \
--rpc_log_dir $RPC_LOG_DIR"
This run command seems to fail in setting the local_rank
and I suspect this is because of how I am launching the node_classification.py
using docker. I did not have this issue with conda previously.
The number of OMP threads per trainer is set to 4
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 ec2-user@10.15.216.47 'cd /home/ec2-user/MassiveGNN; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json DGL_IP_CONFIG=/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc DGL_SERVER_ID=1; docker exec dgl_container python3 baseline/node_classification.py --graph_name ogbn-arxiv --backend nccl --ip_config /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --num_epochs 100 --batch_size 2000 --num_gpus 1 --summary_filepath /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --profile_dir /home/ec2-user/MassiveGNN/profiles --part_config /home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json --rpc_log_dir /home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40)'' returned non-zero exit status 1.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 ec2-user@10.15.216.47 'cd /home/ec2-user/MassiveGNN; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json DGL_IP_CONFIG=/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=0 ; docker exec dgl_container python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=10.15.216.25 --master_port=1234 baseline/node_classification.py --graph_name ogbn-arxiv --backend nccl --ip_config /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --num_epochs 100 --batch_size 2000 --num_gpus 1 --summary_filepath /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --profile_dir /home/ec2-user/MassiveGNN/profiles --part_config /home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json --rpc_log_dir /home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40)'' returned non-zero exit status 1.
Arguments: Namespace(graph_name='ogbn-arxiv', ip_config='/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt', part_config='/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json', n_classes=0, backend='nccl', num_gpus=1, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, summary_filepath='/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt', profile_dir='/home/ec2-user/MassiveGNN/profiles', rpc_log_dir='/home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40', model='sage', num_heads=1)
g4q-st-g4q-cr-0-1: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
g4q-st-g4q-cr-0-1: Initializing PyTorch process group.
Traceback (most recent call last):
File "/home/ec2-user/MassiveGNN/baseline/node_classification.py", line 505, in <module>
main(args)
File "/home/ec2-user/MassiveGNN/baseline/node_classification.py", line 283, in main
th.distributed.init_process_group(backend=args.backend)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 78, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
func_return = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1360, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 235, in _env_rendezvous_handler
rank = int(_get_env_or_raise("RANK"))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 220, in _get_env_or_raise
raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
Initially when I tried to run the usual way as shown below, node_classification.py
failed to import any packages from the container during the SSH session. Adding docker exec
had solved that issue and the script was able to run single rank jobs. The same error occurred, but it was not in the critical path.
"python3 baseline/node_classification.py --graph_name $DATASET_NAME \
--backend $BACKEND \
--ip_config $IP_CONFIG_FILE --num_epochs 100 --batch_size 2000 \
--num_gpus $GPUS_PER_NODE --summary_filepath $SUMMARYFILE \
--profile_dir $PROFILE_DIR \
--part_config $PARTITION_DIR \
--rpc_log_dir $RPC_LOG_DIR "
Thanks for your help!