How to measure the exact amount of data sent/recevived by a machine during distributed training process?

When I run a DGL distributed training job, each machine will send/receive data from other machine. I want to measure the exact amount of data ---- how many bytes of data does a machine need to send/receive?

However, I haven’t found which part of the DGL code do the push and pull during the distributed training process. I had tried KVserver. I added some print() to distributed/kvstore.py and dis_kvstore.py, none of these print() work.

The comments say that “For now, KVServer can only support CPU-to-CPU communication”, and I used GPU to run the distributed training. Is using GPU the reason why I didn’t get any output?

Should I try RPC_server?

Hi,

Currently there’s no way to count the bytes send/recv. Add print to dgl/kvstore.py at 0b3a6216f57891d5b34e4d5d1318128829580fc1 · dmlc/dgl · GitHub should work. The other dis_kvstore.py is deprecated.

Using GPU won’t affect the whole process, since the communication still happens on CPU and copied to GPU later

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.