Finding halos within k hops

sark777 · September 25, 2023, 10:54pm

Hi,

I am working with the example distributed node classification script using ‘products’ dataset. I generated a set of the halo nodes for each partition with:

remote_nid = th.nonzero(g.local_partition.ndata["inner_node"] == False).squeeze()

However, what I am struggling with is to filter the halo set based on rank (trainer). My expectation is that based on the num_layers (hops), I should be able to find a smaller set of unique halos within k path from train_nids.

remote_nid = th.nonzero(g.local_partition.ndata["inner_node"] == False).squeeze() # True indicates local nodes. False Indicates remote nodes.
src, dst = g.local_partition.edges()
hop_halo_nodes_set = check_halo_nodes(g, train_nid, g.local_partition.ndata["inner_node"], src, dst, args.num_layers)

The check_halo_nodes() is for checking how many and which of the halos of the current partition exist within num_layers hops.

def check_halo_nodes(g, train_nid, inner_node, src, dst, num_hops):
    hop_halo_nodes_set = set()
    current_nodes_set = set(train_nid.tolist())
    visited_nodes_set = set()

    for _ in range(num_hops):
        next_nodes_set = set()
        
        # Find indices where dst nodes are in the current_nodes_set
        current_dst_indices = np.where(np.isin(dst, list(current_nodes_set)))[0]
        current_src_nodes = src[current_dst_indices]
        
        # Exclude nodes that point back to already visited nodes
        mask = ~np.isin(current_src_nodes, visited_nodes_set)
        current_src_nodes = current_src_nodes[mask]

        for node in current_src_nodes:
            if not inner_node[node]:
                hop_halo_nodes_set.add(node.item())
            else:
                next_nodes_set.add(node.item())
        
        visited_nodes_set.update(current_nodes_set)
        current_nodes_set = next_nodes_set

    print(f"Number of unique halo nodes after {num_hops} hops: {len(hop_halo_nodes_set)}")
    return hop_halo_nodes_set

This works using 4, 8 machines.

But on 16, 32, for most of the trainers, the function returns 0 unique halo nodes and for a very few trainers, it returns >0 halo set. I can’t figure out why.

This isn’t impossible and will depend on the structure of the graph. But I’d just like to know if my approach aligns with what I am trying to achieve.

Rhett-Ying · September 26, 2023, 11:42pm

train_nid is returned from node_split()? If yes, it’s global IDs. But in the body of check_halo_nodes(), you’re comparing them with partition IDs which are local IDs. Such mismatch is the root cause.

Here’s the good reference about how to fetch local/remote source given train_nid: https://github.com/dmlc/dgl/blob/297e120feabdb430d7d1597b4657864aadd33797/python/dgl/distributed/graph_services.py#L467.

sark777 · September 27, 2023, 3:36am

Thank you so much for your response! That fixed the issue.

system · October 27, 2023, 3:36am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.