Do we have any dgl distributed training with kubernetes cases? I can see the set up guide from the DGL website is for unbuntu. Would love to learn the instruction of distributed mode in kubernetes.
We have a community contributed DGL operator which you can found at GitHub - Qihoo360/dgl-operator: The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes. However this might not be flexible enough in your cases. If you are familiar enough with kubernetes you can follow the yaml posted below, which utilize tfjobs to launch dgl tasks. By parsing the environment variable config into the config that DGL needed.
Feel free to ask more questions if anything unclear.
apiVersion: "kubeflow.org/v1" kind: "TFJob" metadata: name: "dgl-dist-k8s-demo" # Set any name as you like spec: tfReplicaSpecs: PS: replicas: 4 restartPolicy: Never # Don't restart if failed template: spec: volumes: - name: asail-data # This is a fsx folder on the cluster, which has high performance persistentVolumeClaim: claimName: asail-k8s-data-claim - name: partition-efs persistentVolumeClaim: claimName: efs-dist-demo - name: dshm # Shared memeory emptyDir: medium: Memory containers: - name: tensorflow # Can only be named as tensorflow. Don't change this image: public.ecr.aws/s1o7b3d9/asail-public-dev:pytorch1.9-cuda11.1-debug imagePullPolicy: Always securityContext: capabilities: add: ["SYS_PTRACE"] # Enable GDB inside container volumeMounts: - name: partition-efs mountPath: /root/data - name: asail-data mountPath: /data - name: dshm mountPath: /dev/shm env: - name: DGL_NUM_TRAINER value: "2" - name: DGL_NUM_SAMPLER value: "2" - name: DGL_NUM_SERVER value: "2" - name: OMP_NUM_THREADS value: "24" args: - "bash" - "-c" - | # Part 1 install dgl mkdir -p /workspace cd /workspace pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html python -c "import dgl; print(dgl.__path__)" # init # Part 2 Prepare environment variable for dgl cd /workspace wget https://github.com/VoVAllen/dgl-k8s-example/raw/master/gen_env_script.py python gen_env_script.py cat env.sh source env.sh export DGL_DIST_MODE=distributed export PYTHONUNBUFFERED=1 # Part 3 Custom part export DGL_GRAPH_FORMAT=csc export DGL_CONF_PATH=/data/distdgl/ogb-papers100M-4-128p/ogb-paper100M.json ## Start server export DGL_SERVER_START_IDX=$(($MACHINE_ID*$DGL_NUM_SERVER)) for ((i = 0; i < $DGL_NUM_SERVER; i++)); do export DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i)) echo "Start server $DGL_SERVER_ID" DGL_ROLE=server DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i)) python3 -c "import dgl; dgl.distributed.initialize(None);" & done ## Start client cd /root/data/dgl/examples/pytorch/graphsage/experimental/ export PYTHONPATH=..:$PYTHONPATH DGL_ROLE=client python3 -m torch.distributed.launch --nproc_per_node=$DGL_NUM_TRAINER --nnodes=$NUM_MACHINES --node_rank=$MACHINE_ID --master_addr=$MASTER_ADDRESS --master_port=1234 \ train_dist.py --graph_name ogb-paper100M --ip_config $DGL_IP_CONFIG --num_epochs 30 --batch_size 1000
Thanks for the reply! I have checked out the DGL Operator and it couldn’t be directly work on our environment.
So what’s suggestion to set up the nfs in K8s in order to fulfill the requirement described in Distributed Node Classification — DGL 0.7.2 documentation?
Or does a volume mount could satisfy the nts need above and we don’t need to do additional setting?
It depends on your environment. NFS would be a good choice, or other similar file system such as alluxio, which supports read by multiple machines. Or if you are on cloud such as aws, you can try EFS/FSx or similar services.
I am not in aws, all my machines are in Kubernetes, what’s the suggestion for DGL dist training here? Basicly I want to know does adding a NFS server and other training pods as NFS clients would work fine for DGL.
This works. Basically it’s just for reading the dataset. You just need to make the dataset accessible for all the training pods
Thanks for the yaml example, do you use it for kubectl create? the args format somehow doesnt work for me.
Yes. Could you post the error message? This is the expansion of our launching tools, some args might need to change for different scenarios.