Fetching halo node's features using RPC

sark777 · October 6, 2023, 8:32pm

Hi,

This question is similar to Calculate bytes sent and received with communication time per machine in distributed training, but I didn’t find a straightforward answer.

What is the best way to measure the time spent on rpc per minibatch per trainer?
As far as my current understanding goes, g.ndata["features"][input_nodes] also pulls features of the local nodes so some time will also be spent on retrieving the local nodes’ features from the local KVstore. Is this overlapped with rpc?
How do I measure how many rpc calls were made per minibatch per trainer?
In this post When and how to fetch features on the remote machine, you mention -

In short, local ids will be converted to global ids and send request to target machines

However, doesn’t the input_nodes returned by the dataloader for step, (input_nodes, seeds, blocks) in enumerate(dataloader) already represent global_ids?

Thank you for your time!

minjie · October 7, 2023, 10:41pm

I don’t think there is straightforward way to do that besides digging into the codebase and time the remote and local operations separately.
Yes, it is overlapped with rpc.
Similar answer to 1.
Yes, the ID returned by enumerator is always global IDs. The local IDs are internal implementation details.

system · November 6, 2023, 10:41pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.