DGL with Kubernetes

Do we have any dgl distributed training with kubernetes cases? I can see the set up guide from the DGL website is for unbuntu. Would love to learn the instruction of distributed mode in kubernetes.

Hi,

We have a community contributed DGL operator which you can found at GitHub - Qihoo360/dgl-operator: The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes. However this might not be flexible enough in your cases. If you are familiar enough with kubernetes you can follow the yaml posted below, which utilize tfjobs to launch dgl tasks. By parsing the environment variable config into the config that DGL needed.

Feel free to ask more questions if anything unclear.

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "dgl-dist-k8s-demo" # Set any name as you like
spec:
  tfReplicaSpecs:
    PS:
      replicas: 4
      restartPolicy: Never # Don't restart if failed
      template:
        spec:      
          volumes:
            - name: asail-data # This is a fsx folder on the cluster, which has high performance
              persistentVolumeClaim:
                claimName: asail-k8s-data-claim
            - name: partition-efs
              persistentVolumeClaim:
                claimName: efs-dist-demo
            - name: dshm # Shared memeory
              emptyDir:
                medium: Memory
          containers:
            - name: tensorflow # Can only be named as tensorflow. Don't change this
              image: public.ecr.aws/s1o7b3d9/asail-public-dev:pytorch1.9-cuda11.1-debug
              imagePullPolicy: Always
              securityContext:
                capabilities:
                  add: ["SYS_PTRACE"] # Enable GDB inside container
              volumeMounts:
              - name: partition-efs
                mountPath: /root/data
              - name: asail-data
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
              env: 
              - name: DGL_NUM_TRAINER
                value: "2"
              - name: DGL_NUM_SAMPLER
                value: "2"
              - name: DGL_NUM_SERVER
                value: "2"
              - name: OMP_NUM_THREADS
                value: "24"
              args: 
              - "bash"
              - "-c"
              - |
                # Part 1 install dgl
                mkdir -p /workspace
                cd /workspace
                pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html
                python -c "import dgl; print(dgl.__path__)" # init

                # Part 2 Prepare environment variable for dgl
                cd /workspace
                wget https://github.com/VoVAllen/dgl-k8s-example/raw/master/gen_env_script.py
                python gen_env_script.py
                cat env.sh
                source env.sh
                export DGL_DIST_MODE=distributed
                export PYTHONUNBUFFERED=1

                # Part 3 Custom part
                export DGL_GRAPH_FORMAT=csc
                export DGL_CONF_PATH=/data/distdgl/ogb-papers100M-4-128p/ogb-paper100M.json
                ## Start server
                export DGL_SERVER_START_IDX=$(($MACHINE_ID*$DGL_NUM_SERVER))
                for ((i = 0; i < $DGL_NUM_SERVER; i++)); do
                    export DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i))
                    echo "Start server $DGL_SERVER_ID"
                    DGL_ROLE=server DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i)) python3 -c "import dgl; dgl.distributed.initialize(None);" &
                done

                ## Start client
                cd /root/data/dgl/examples/pytorch/graphsage/experimental/
                
                export PYTHONPATH=..:$PYTHONPATH                
                DGL_ROLE=client python3 -m torch.distributed.launch --nproc_per_node=$DGL_NUM_TRAINER --nnodes=$NUM_MACHINES --node_rank=$MACHINE_ID --master_addr=$MASTER_ADDRESS --master_port=1234 \
                train_dist.py --graph_name ogb-paper100M --ip_config $DGL_IP_CONFIG --num_epochs 30 --batch_size 1000

Thanks for the reply! I have checked out the DGL Operator and it couldn’t be directly work on our environment.

So what’s suggestion to set up the nfs in K8s in order to fulfill the requirement described in Distributed Node Classification — DGL 0.7.2 documentation?

Or does a volume mount could satisfy the nts need above and we don’t need to do additional setting?

Thanks

Hi,

It depends on your environment. NFS would be a good choice, or other similar file system such as alluxio, which supports read by multiple machines. Or if you are on cloud such as aws, you can try EFS/FSx or similar services.

I am not in aws, all my machines are in Kubernetes, what’s the suggestion for DGL dist training here? Basicly I want to know does adding a NFS server and other training pods as NFS clients would work fine for DGL.

This works. Basically it’s just for reading the dataset. You just need to make the dataset accessible for all the training pods

Thanks for the yaml example, do you use it for kubectl create? the args format somehow doesnt work for me.

Yes. Could you post the error message? This is the expansion of our launching tools, some args might need to change for different scenarios.