Why is the num_hops in partition scripts not the same value as GNN num_layers?

Hello, I have a question related to distributed GNN training using DistDGL, and I hope you may give me some advice if I was wrong or not.

  • My understanding of distributed GNN training: To my knowledge, after partitioning the graph, each worker should have its own partition with all its k-hop neighbors, where k is the num of GNN layers, together with its own partition’s node/edge embeddings. And then during training, workers use DistTensor to request embeddings from other hosts according to their own partitions.
  • Thus, I wonder why does the partition example not assign num_hops according to the GNN num_layers but use the default value 1, while the num_layers in training scripts is by default 2. I wonder if this is correct for training.
  • Besides, if I change the num_layers in the training scripts, is the training process still correct in this example?

Hi,

Each partition’s num_hop is not related to the hops later in the sampling part. Minimal one hops is required for each partition. But for more hops, you can consider it as a local cache. If cache doesn’t exist, it will reach other partition to get the related neighbors.

For example, if you want to do 2hop sampling on 1hop partition, the second sampling hop will query to other partition about the neighbors.

Each node is owned by only one partition. But many nodes can be duplicated in multiple partitions, like a local cache in the distributed system