Do we have any dgl distributed training with kubernetes cases? I can see the set up guide from the DGL website is for unbuntu. Would love to learn the instruction of distributed mode in kubernetes.
Hi,
We have a community contributed DGL operator which you can found at GitHub - Qihoo360/dgl-operator: The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes. However this might not be flexible enough in your cases. If you are familiar enough with kubernetes you can follow the yaml posted below, which utilize tfjobs to launch dgl tasks. By parsing the environment variable config into the config that DGL needed.
Feel free to ask more questions if anything unclear.
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "dgl-dist-k8s-demo" # Set any name as you like
spec:
tfReplicaSpecs:
PS:
replicas: 4
restartPolicy: Never # Don't restart if failed
template:
spec:
volumes:
- name: asail-data # This is a fsx folder on the cluster, which has high performance
persistentVolumeClaim:
claimName: asail-k8s-data-claim
- name: partition-efs
persistentVolumeClaim:
claimName: efs-dist-demo
- name: dshm # Shared memeory
emptyDir:
medium: Memory
containers:
- name: tensorflow # Can only be named as tensorflow. Don't change this
image: public.ecr.aws/s1o7b3d9/asail-public-dev:pytorch1.9-cuda11.1-debug
imagePullPolicy: Always
securityContext:
capabilities:
add: ["SYS_PTRACE"] # Enable GDB inside container
volumeMounts:
- name: partition-efs
mountPath: /root/data
- name: asail-data
mountPath: /data
- name: dshm
mountPath: /dev/shm
env:
- name: DGL_NUM_TRAINER
value: "2"
- name: DGL_NUM_SAMPLER
value: "2"
- name: DGL_NUM_SERVER
value: "2"
- name: OMP_NUM_THREADS
value: "24"
args:
- "bash"
- "-c"
- |
# Part 1 install dgl
mkdir -p /workspace
cd /workspace
pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html
python -c "import dgl; print(dgl.__path__)" # init
# Part 2 Prepare environment variable for dgl
cd /workspace
wget https://github.com/VoVAllen/dgl-k8s-example/raw/master/gen_env_script.py
python gen_env_script.py
cat env.sh
source env.sh
export DGL_DIST_MODE=distributed
export PYTHONUNBUFFERED=1
# Part 3 Custom part
export DGL_GRAPH_FORMAT=csc
export DGL_CONF_PATH=/data/distdgl/ogb-papers100M-4-128p/ogb-paper100M.json
## Start server
export DGL_SERVER_START_IDX=$(($MACHINE_ID*$DGL_NUM_SERVER))
for ((i = 0; i < $DGL_NUM_SERVER; i++)); do
export DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i))
echo "Start server $DGL_SERVER_ID"
DGL_ROLE=server DGL_SERVER_ID=$(($DGL_SERVER_START_IDX+$i)) python3 -c "import dgl; dgl.distributed.initialize(None);" &
done
## Start client
cd /root/data/dgl/examples/pytorch/graphsage/experimental/
export PYTHONPATH=..:$PYTHONPATH
DGL_ROLE=client python3 -m torch.distributed.launch --nproc_per_node=$DGL_NUM_TRAINER --nnodes=$NUM_MACHINES --node_rank=$MACHINE_ID --master_addr=$MASTER_ADDRESS --master_port=1234 \
train_dist.py --graph_name ogb-paper100M --ip_config $DGL_IP_CONFIG --num_epochs 30 --batch_size 1000
Thanks for the reply! I have checked out the DGL Operator and it couldn’t be directly work on our environment.
So what’s suggestion to set up the nfs in K8s in order to fulfill the requirement described in Distributed Node Classification — DGL 0.7.2 documentation?
Or does a volume mount could satisfy the nts need above and we don’t need to do additional setting?
Thanks
Hi,
It depends on your environment. NFS would be a good choice, or other similar file system such as alluxio, which supports read by multiple machines. Or if you are on cloud such as aws, you can try EFS/FSx or similar services.
I am not in aws, all my machines are in Kubernetes, what’s the suggestion for DGL dist training here? Basicly I want to know does adding a NFS server and other training pods as NFS clients would work fine for DGL.
This works. Basically it’s just for reading the dataset. You just need to make the dataset accessible for all the training pods
Thanks for the yaml example, do you use it for kubectl create? the args format somehow doesnt work for me.
Yes. Could you post the error message? This is the expansion of our launching tools, some args might need to change for different scenarios.
This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.