Couting the nodes for a node type in heterogeneous graph

In the following code segment concerning heterogeneous graph creation, it seems to me that node 3 and 4 are of node type gene. So, g.nodes('gene') should return tensor([3, 4]). However, it returns tensor([0, 1, 2, 3, 4]) . I am not sure whether I am missing something or not.

graph_data = {   
   ('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([3, 4]))    
g = dgl.heterograph(graph_data)

DGL labels the nodes id from 0 consecutively for each node type. If you pass tensor([3,4]) as edges, dgl will consider it as node 0,1 has no edges, but node 0,1 exists.

You could also refer to the user guide chapter on heterograph for more details.

Thanks a lot for your response. However, my confusion still prevails. Let me rephrase the question.

I assume that g.nodes('gene') returns the set of nodes that are of type 'gene'. If that understanding is correct, the statement print(g.nodes('gene')) should return tensor([3,4]) but not tensor([0, 1, 2, 3, 4]). I suppose that nodes 0, 1, 2 do not belong to the category 'gene'.

Any explanation in this direction will be of great help.

Hi, sorry for the late reply.

I assume that g.nodes('gene') returns the set of nodes that are of type 'gene' .

Your understanding till this point is very correct. The discrepancy is how DGL represents a node in heterogeneous graph. In general, there are two approaches:

  1. Group all nodes in one ID space. For example, 0,1,2 are drug nodes, 3,4 are gene nodes. I think this is what you have in your mind.
  2. Each type of nodes have their own ID space (starting from zero). Therefore, one can think of each node being a pair. For example, (‘drug’, 0), (‘drug’, 1), (‘drug’, 2) are three drug nodes while (‘gene’, 0), (‘gene’, 1) are two gene nodes.

DGL uses the second approach because it makes internal storage more efficient. Therefore, when you specify the graph_data as {('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([3, 4]))}, DGL treats it as two edges (‘drug’, 0) → (‘gene’, 3) and (‘drug’, 1) → (‘gene’, 4). Since the node IDs must start from zero, DGL then think the ‘gene’ type has 5 nodes but only (‘gene’, 3) and (‘gene’, 4) have in-coming edges. That’s why when you call g.nodes('gene'), it returns [0, 1, 2, 3, 4].

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.