I have a few questions about how the distribution for distributed GNN training works and would appreciate corrections/clarifications.
-
Each host gets a disjoint partition of the graph. However, to my understanding, it is possible to access the node/edge data from other hosts using the Distributed Tensor interface which will send requests out for data that is not local. Is this correct?
-
I’m assuming that the Distributed Tensor works something like a Distributed Shared Memory where it will cache non-local items for a period of time. Is this correct?
-
Are there limits to what parts of a distributed tensor that each host can write? e.g. is it limited to writing its local shard? What about consistency? e.g. what happens if many hosts write to the same location in the tensor?
-
All hosts will end up seeing the same “result” due to the nature of the Distributed Tensor, correct? i.e., even though all hosts work on a disjoint set of training nodes, they will all write to the same result matrix
EDIT:
5) How are the parameters of each layer updated on each host? To my understanding, each host seems like it will only train the model it owns to work with its local subgraph, so the model on each host will end up different at the end of training. Is this correct?
Please correct any wrong assumptions I may have made in these questions.
Thank you.