Distributed GNN Training questions

I have a few questions about how the distribution for distributed GNN training works and would appreciate corrections/clarifications.

  1. Each host gets a disjoint partition of the graph. However, to my understanding, it is possible to access the node/edge data from other hosts using the Distributed Tensor interface which will send requests out for data that is not local. Is this correct?

  2. I’m assuming that the Distributed Tensor works something like a Distributed Shared Memory where it will cache non-local items for a period of time. Is this correct?

  3. Are there limits to what parts of a distributed tensor that each host can write? e.g. is it limited to writing its local shard? What about consistency? e.g. what happens if many hosts write to the same location in the tensor?

  4. All hosts will end up seeing the same “result” due to the nature of the Distributed Tensor, correct? i.e., even though all hosts work on a disjoint set of training nodes, they will all write to the same result matrix

EDIT:
5) How are the parameters of each layer updated on each host? To my understanding, each host seems like it will only train the model it owns to work with its local subgraph, so the model on each host will end up different at the end of training. Is this correct?

Please correct any wrong assumptions I may have made in these questions.

Thank you.

  1. Yes. You can read/write data from/to any partition.

  2. Currently there’s no cache, but they are designed with locality. That through the partition strategy, we would like to let most data needed on local machine

  3. Consistency is now guaranteed by barrier, which is inefficient in some case but not in our current scenario. Because it is designed for synchronize training, the parameters need to allreduce/synchronize. There’s sync point and all the write operation happens just after that time. Therefore there won’t be much performance issue in current scenario. There’s only one copy of each ndata/edata across the cluster. If two workers write to one ndata/edata, the later one will be preserved

  4. Yes. What’s the difference between this question and the first one?

  5. Parameters are updated through framework(PyTorch) operator. Currently it happens on loss.backward(), which is something like `torch.distributed.allreduce.

Feel free to ask if there’s anything I didn’t explain clearly.

Thank you for the replies.

re: 1) and 4): The difference between 1 and 4 is that for 1 I seemed to be more interested in reads while in 4 I was asking more about writes.

For the reads: I’m assuming that non-local reads are blocking, correct? If not, how exactly do reads occur for non-local?

Some additional questions:

  • re: parameters of layers: Are the parameters of the layers duplicated on all hosts since the entire parameter set is needed to do forward propagation?
  • re: allreduce for gradient synchronization: Correct me if I’m wrong, but does this mean that the individual gradients are summed together from each host and then used to update the layer weights? If this is the case, then no explicit sync of the parameters would be necessary since they would all do the same update since they have the same gradients, correct?
  • Are there any accuracy tradeoffs for moving into the distributed setting? For example, since gradients are presumably combined with an allreduce, the gradients the system ends up with in the distributed setting would be different from those in the single host setting.

Thank you!

Both local and non-local reads are blocking. They are packed into a pull message.

  • The graph has been partitioned into different partitions and each machine only has its own part.
  • Gradient will be averaged before update.
  • We are looking forward some async distributed method to replace Allreduce.

More information can be found here: https://docs.dgl.ai/guide/distributed.html

Thanks!

Thanks! I’ll ask if I have any more questions, but this is it for now.

Hello again. I have a follow-up question that I don’t think I got an answer to before.

The weights of the DNN model for each layer have to be duplicated across partitions since all of the weights are required to do forward propagation, correct?

Many thanks,
Loc

Sorry I missed this question. You are right.

Thanks for the reply!