Why is the num_hops in partition scripts not the same value as GNN num_layers?

xcwanAndy · July 7, 2021, 12:42pm

Hello, I have a question related to distributed GNN training using DistDGL, and I hope you may give me some advice if I was wrong or not.

My understanding of distributed GNN training: To my knowledge, after partitioning the graph, each worker should have its own partition with all its k-hop neighbors, where k is the num of GNN layers, together with its own partition’s node/edge embeddings. And then during training, workers use DistTensor to request embeddings from other hosts according to their own partitions.
Thus, I wonder why does the partition example not assign num_hops according to the GNN num_layers but use the default value 1, while the num_layers in training scripts is by default 2. I wonder if this is correct for training.
Besides, if I change the num_layers in the training scripts, is the training process still correct in this example?

VoVAllen · July 12, 2021, 7:40am

Hi,

Each partition’s num_hop is not related to the hops later in the sampling part. Minimal one hops is required for each partition. But for more hops, you can consider it as a local cache. If cache doesn’t exist, it will reach other partition to get the related neighbors.

For example, if you want to do 2hop sampling on 1hop partition, the second sampling hop will query to other partition about the neighbors.

VoVAllen · July 12, 2021, 7:41am

Each node is owned by only one partition. But many nodes can be duplicated in multiple partitions, like a local cache in the distributed system

xcwanAndy · August 6, 2021, 3:25pm

@VoVAllen

“For example, if you want to do 2hop sampling on 1hop partition, the second sampling hop will query to other partition about the neighbors.”

Yes, I agree with this point. But my question is: if only set one-hop during partitioning, each partition will end up with a graph structure with only one-hop. Then, as the graph structure itself do not provide two-hop graph structure information, the second sampling hop will be unable to query the two-hop nodes.

So my point is that: the partition process should have an identical k-hop as the sampling algorithm, but only the k-hop graph structure information and its local nodes’ feature information, rather than only 1-hop graph structure and the local nodes’ feature information.

I wonder if my point is correct or not?

VoVAllen · August 9, 2021, 6:27am

You are right, this is what we did now. The node feature will have only one storage and there’s no cache in each partition. All the number of hops are related to the graph structures as the extra_cached_hops

github.com

dmlc/dgl/blob/7359481497b1ba30d029bdfabe6a4bb6333f27ca/python/dgl/partition.py#L327

    
      
              start = time.time()
              node_part = _CAPI_DGLMetisPartition_Hetero(sym_g._graph, k, vwgt, mode)
              print('Metis partitioning: {:.3f} seconds'.format(time.time() - start))
              if len(node_part) == 0:
                  return None
              else:
                  node_part = utils.toindex(node_part)
                  return node_part.tousertensor()
          
          

          
def metis_partition(g, k, extra_cached_hops=0, reshuffle=False,
                              balance_ntypes=None, balance_edges=False, mode="k-way"):
              ''' This is to partition a graph with Metis partitioning.
          
          
    Metis assigns vertices to partitions. This API constructs subgraphs with the vertices assigned
              to the partitions and their incoming edges. A subgraph may contain HALO nodes which does
              not belong to the partition of a subgraph but are connected to the nodes
              in the partition within a fixed number of hops.
          
          
    When performing Metis partitioning, we can put some constraint on the partitioning.
              Current, it supports two constrants to balance the partitioning. By default, Metis

xcwanAndy · August 9, 2021, 7:00am

Thanks! Now I’ve got the point.

system · September 8, 2021, 7:00am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.