Calculate bytes sent and received with communication time per machine in distributed training

I want to calculate the communication time and the total number of bytes sent and received among all machines for features transfer.

Specifically, I want to calculate how much time each machine spends communicating with every other machine, how many bytes every machine sends to every other machine, and how many bytes every machine receives from every other machine.

I want to calculate these statistics per-machine. For example, if there are 4 machines A, B, C, D, I want something like this:

A <-> B: total bytes sent: xxxx bytes, total time spent sending: xxxx ms, total bytes recieved: xxx bytes, total time spent receiving: xxxx ms.
A <-> C: ...
A <-> D: ...
B <-> A: ...
B <-> C: ...
   ⋮

As far as my understanding goes, the remote requests are sent in rpc.cc here. But where are the requests received and handled on the other machines?

Is it possible to achieve this? If yes, what would be the efficient way to do so?

Messages are sent or received via https://github.com/dmlc/dgl/blob/b8886900837e3fc73972215b7c5e9b3e127acbfc/src/rpc/rpc.cc#L34 or https://github.com/dmlc/dgl/blob/b8886900837e3fc73972215b7c5e9b3e127acbfc/src/rpc/rpc.cc#L39. client_id and server_id can be retrieved from RPCMessage, so does the data and payloads.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.