Hi, sorry for the late reply.
I assume that g.nodes('gene')
returns the set of nodes that are of type 'gene'
.
Your understanding till this point is very correct. The discrepancy is how DGL represents a node in heterogeneous graph. In general, there are two approaches:
- Group all nodes in one ID space. For example, 0,1,2 are drug nodes, 3,4 are gene nodes. I think this is what you have in your mind.
- Each type of nodes have their own ID space (starting from zero). Therefore, one can think of each node being a pair. For example, (‘drug’, 0), (‘drug’, 1), (‘drug’, 2) are three drug nodes while (‘gene’, 0), (‘gene’, 1) are two gene nodes.
DGL uses the second approach because it makes internal storage more efficient. Therefore, when you specify the graph_data
as {('drug', 'interacts', 'gene'): (th.tensor([0, 1]), th.tensor([3, 4]))}
, DGL treats it as two edges (‘drug’, 0) → (‘gene’, 3) and (‘drug’, 1) → (‘gene’, 4). Since the node IDs must start from zero, DGL then think the ‘gene’ type has 5 nodes but only (‘gene’, 3) and (‘gene’, 4) have in-coming edges. That’s why when you call g.nodes('gene')
, it returns [0, 1, 2, 3, 4].