How to run DistDGL using docker container on slurm?

Hi,

I have been trying to run DistDGL using this NVIDIA docker container. I am having trouble figuring out the run command. I’m using Slurm on AWS ParallelCluster with 2 g4dn.2xlarge compute nodes and a t2.micro head node running alinux2, and all in private subnets.

docker run --gpus all \
      --network=host \
      --ipc=host \
      --privileged \
      -v /home/ec2-user:/home/ec2-user \
      -v /home/ec2-user/.ssh:/root/.ssh \
      -w $PROJ_PATH \
      --name dgl_container \
      --rm nvcr.io/nvidia/dgl:24.07-py3 \
        python3 $PROJ_PATH/launch.py \
            --ssh_username ec2-user \
            --workspace $PROJ_PATH \
            --num_trainers $GPUS_PER_NODE \
            --num_samplers $SAMPLER_PROCESSES \
            --num_servers 1 \
            --part_config $PARTITION_DIR \
            --ip_config $IP_CONFIG_FILE \
            "docker exec dgl_container python3 baseline/node_classification.py --graph_name $DATASET_NAME \
            --backend $BACKEND \
            --ip_config $IP_CONFIG_FILE --num_epochs 100 --batch_size 2000 \
            --num_gpus $GPUS_PER_NODE --summary_filepath $SUMMARYFILE \
            --profile_dir $PROFILE_DIR \
            --part_config $PARTITION_DIR \
            --rpc_log_dir $RPC_LOG_DIR"

This run command seems to fail in setting the local_rank and I suspect this is because of how I am launching the node_classification.py using docker. I did not have this issue with conda previously.

The number of OMP threads per trainer is set to 4
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 ec2-user@10.15.216.47 'cd /home/ec2-user/MassiveGNN; (export DGL_ROLE=server DGL_NUM_SAMPLER=0 OMP_NUM_THREADS=1 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json DGL_IP_CONFIG=/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc  DGL_SERVER_ID=1; docker exec dgl_container python3 baseline/node_classification.py --graph_name ogbn-arxiv             --backend nccl             --ip_config /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --num_epochs 100 --batch_size 2000             --num_gpus 1 --summary_filepath /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt             --profile_dir /home/ec2-user/MassiveGNN/profiles             --part_config /home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json             --rpc_log_dir /home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40)'' returned non-zero exit status 1.
Called process error Command 'ssh -o StrictHostKeyChecking=no -p 22 ec2-user@10.15.216.47 'cd /home/ec2-user/MassiveGNN; (export DGL_DIST_MODE=distributed DGL_ROLE=client DGL_NUM_SAMPLER=0 DGL_NUM_CLIENT=2 DGL_CONF_PATH=/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json DGL_IP_CONFIG=/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt DGL_NUM_SERVER=1 DGL_GRAPH_FORMAT=csc OMP_NUM_THREADS=4 DGL_GROUP_ID=0 ; docker exec dgl_container python3 -m torch.distributed.run --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=10.15.216.25 --master_port=1234 baseline/node_classification.py --graph_name ogbn-arxiv             --backend nccl             --ip_config /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt --num_epochs 100 --batch_size 2000             --num_gpus 1 --summary_filepath /home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt             --profile_dir /home/ec2-user/MassiveGNN/profiles             --part_config /home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json             --rpc_log_dir /home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40)'' returned non-zero exit status 1.
Arguments: Namespace(graph_name='ogbn-arxiv', ip_config='/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ip_config/ip_config_ogbn-arxiv_metis_n2_samp0_trainer1_40.txt', part_config='/home/ec2-user/MassiveGNN/partitions/ogbn-arxiv/2_parts/ogbn-arxiv.json', n_classes=0, backend='nccl', num_gpus=1, num_epochs=100, num_hidden=16, num_layers=2, fan_out='10,25', batch_size=2000, batch_size_eval=100000, log_every=20, eval_every=5, lr=0.003, dropout=0.5, local_rank=None, pad_data=False, summary_filepath='/home/ec2-user/MassiveGNN/baseline_job_script/logs/ogbn-arxiv/distdgl/logs_perlmutter_gpu_nccl/sage/ogbn-arxiv_metis_n2_samp0_trainer1_40.txt', profile_dir='/home/ec2-user/MassiveGNN/profiles', rpc_log_dir='/home/ec2-user/MassiveGNN/baseline_job_script/logs/rpc_logs/baseline/logs_40', model='sage', num_heads=1)
g4q-st-g4q-cr-0-1: Initializing DistDGL.
Initialize the distributed services with graphbolt: False
g4q-st-g4q-cr-0-1: Initializing PyTorch process group.
Traceback (most recent call last):
  File "/home/ec2-user/MassiveGNN/baseline/node_classification.py", line 505, in <module>
    main(args)
  File "/home/ec2-user/MassiveGNN/baseline/node_classification.py", line 283, in main
    th.distributed.init_process_group(backend=args.backend)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 78, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 92, in wrapper
    func_return = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1360, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 235, in _env_rendezvous_handler
    rank = int(_get_env_or_raise("RANK"))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 220, in _get_env_or_raise
    raise _env_error(env_var)
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set

Initially when I tried to run the usual way as shown below, node_classification.py failed to import any packages from the container during the SSH session. Adding docker exec had solved that issue and the script was able to run single rank jobs. The same error occurred, but it was not in the critical path.

 "python3 baseline/node_classification.py --graph_name $DATASET_NAME \
            --backend $BACKEND \
            --ip_config $IP_CONFIG_FILE --num_epochs 100 --batch_size 2000 \
            --num_gpus $GPUS_PER_NODE --summary_filepath $SUMMARYFILE \
            --profile_dir $PROFILE_DIR \
            --part_config $PARTITION_DIR \
            --rpc_log_dir $RPC_LOG_DIR "

Thanks for your help!

For anyone encountering the same issue, I found that the root cause was related to environment variables, such as DGL_ROLE, not being passed into the Docker container when using docker exec like above. Make sure these variables are exported inside the container to avoid this problem.