How to measure the exact amount of data sent/recevived by a machine during distributed training process?

zfthys · August 25, 2021, 5:10pm

When I run a DGL distributed training job, each machine will send/receive data from other machine. I want to measure the exact amount of data ---- how many bytes of data does a machine need to send/receive?

However, I haven’t found which part of the DGL code do the push and pull during the distributed training process. I had tried KVserver. I added some print() to distributed/kvstore.py and dis_kvstore.py, none of these print() work.

The comments say that “For now, KVServer can only support CPU-to-CPU communication”, and I used GPU to run the distributed training. Is using GPU the reason why I didn’t get any output?

Should I try RPC_server?

VoVAllen · August 27, 2021, 6:03am

Hi,

Currently there’s no way to count the bytes send/recv. Add print to dgl/kvstore.py at 0b3a6216f57891d5b34e4d5d1318128829580fc1 · dmlc/dgl · GitHub should work. The other dis_kvstore.py is deprecated.

Using GPU won’t affect the whole process, since the communication still happens on CPU and copied to GPU later

system · September 26, 2021, 6:04am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.