[DistDGL related] When and how to fetch features on the remote machine

I’m looking into DistDGL’s implementation on neighbor sampling. I found that it would first check if required nodes are on the local machine. If it is a local node, it would call sample_neighbor function on the local node. Otherwise, it would send a SampleRequest to the remote machines. Then they would be merged together to form a sampled graph. (If there’s any mistake of my understanding, please point out! Thanks. )

But my confusion is about how to fetch node/edge features on remote machines. I notice the SampleRequest would return the node/edge structure & id, but I don’t find the code for sending feature message. Would the server send feature as well with the ID? Or the feature is fetched in other way?

Features are stored/fetched in/via DistTensor in DistDGL. please refer to below link for more details: dgl.distributed — DGL 0.8 documentation

I read the detailed description in DistTensor, but I’m still unclear on how one machine fetches the DistTensor which stores on the other machine. Could you please clarify a little more?

I know that features are stored in DistTensor data structures. It is sharded and stored in a cluster of machines, which means that each machine hold one part of feature matrix (for the nodes in this part). So when Machine 0 needs the feature DistTensor from machine 1, how does it fetch DistTensor? Is it still like RPC request, similar with what I have found for communications on graph structures, or there are some other methods working on that?

Thanks

yes, remote data is fetched via RPC. the details of impl is a bit complicated. In short, local ids will be converted to global ids and send request to target machines which stores the target data, receive response. please refer to below code/call path for more details.

Got it! The last question is that when do we fetch remote data via RPC? Does it happen when forming blocks, or when doing forward computation for one batch, or something else?
Thanks for clarification.

when tensor data is accessed, namely __getitem__ is called such as g.ndata['h'][0:10] or dgl/train_dist.py at 39121dfdb80e36620fa62517da020f7a8c12e51a · dmlc/dgl · GitHub

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.