Why is the num_hops in partition scripts not the same value as GNN num_layers?

Hello, I have a question related to distributed GNN training using DistDGL, and I hope you may give me some advice if I was wrong or not.

  • My understanding of distributed GNN training: To my knowledge, after partitioning the graph, each worker should have its own partition with all its k-hop neighbors, where k is the num of GNN layers, together with its own partition’s node/edge embeddings. And then during training, workers use DistTensor to request embeddings from other hosts according to their own partitions.
  • Thus, I wonder why does the partition example not assign num_hops according to the GNN num_layers but use the default value 1, while the num_layers in training scripts is by default 2. I wonder if this is correct for training.
  • Besides, if I change the num_layers in the training scripts, is the training process still correct in this example?

Hi,

Each partition’s num_hop is not related to the hops later in the sampling part. Minimal one hops is required for each partition. But for more hops, you can consider it as a local cache. If cache doesn’t exist, it will reach other partition to get the related neighbors.

For example, if you want to do 2hop sampling on 1hop partition, the second sampling hop will query to other partition about the neighbors.

Each node is owned by only one partition. But many nodes can be duplicated in multiple partitions, like a local cache in the distributed system

@VoVAllen

“For example, if you want to do 2hop sampling on 1hop partition, the second sampling hop will query to other partition about the neighbors.”

Yes, I agree with this point. But my question is: if only set one-hop during partitioning, each partition will end up with a graph structure with only one-hop. Then, as the graph structure itself do not provide two-hop graph structure information, the second sampling hop will be unable to query the two-hop nodes.

So my point is that: the partition process should have an identical k-hop as the sampling algorithm, but only the k-hop graph structure information and its local nodes’ feature information, rather than only 1-hop graph structure and the local nodes’ feature information.

I wonder if my point is correct or not?

You are right, this is what we did now. The node feature will have only one storage and there’s no cache in each partition. All the number of hops are related to the graph structures as the extra_cached_hops

Thanks! Now I’ve got the point.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.