Convolutions with mini-batches of heterogeneous graph

I initially created

input_features = {
                ntype: blocks[0].srcdata['feature'][ntype]
                for ntype in blocks[0].ntypes
            }

which is

{'disease': tensor([[-5.1274e-02,  2.6597e-02, -1.2490e-16,  ..., -1.8296e-02,
         -2.0168e-02, -1.1136e-02],
      ...,
        [-7.0330e-01, -5.1164e-01,  4.9960e-16,  ...,  6.5890e-02,
          1.1018e-01, -4.0044e-01]]), 'drug': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
       ...,
        [0., 0., 0.,  ..., 0., 0., 0.]]), 'protein': tensor([[-0.1390, -0.1390, -0.0860,  ...,  1.7840,  1.7840,  1.7840],
       ...,
        [ 0.8630,  0.8630, -2.8150,  ...,  0.4730,  0.4730,  0.4730]])}

I pass these input_features to the model where they are used as the second parameter in the forward:

Now when I call ‘inputs’ in the forward method of the RelGraphConv, this is the value:

ParameterDict(
    (disease): Parameter containing: [torch.FloatTensor of size 24x348]
    (drug): Parameter containing: [torch.FloatTensor of size 50x348]
    (protein): Parameter containing: [torch.FloatTensor of size 372x348]
)

Then, the inputs is being passed to the CustomHeteroGraphConv in the end and when I check the value of the second parameter h in the forward function, I receive this:

>>> h
(Parameter containing:
tensor([[-0.0417,  0.0290, -0.0552,  ...,  0.1027, -0.1511, -0.0123],
      ..., 
        [ 0.0842, -0.1540,  0.0074,  ...,  0.0233,  0.0525, -0.0698]],
       requires_grad=True), tensor([[-0.1006, -0.1324, -0.0903,  ...,  0.0811, -0.0562,  0.1083],
       ..., 
        [-0.0921, -0.1670, -0.1631,  ..., -0.0653,  0.0133,  0.0732]],
       grad_fn=<SliceBackward>))

Since the number of source and destination nodes is different even for a same node type, you need to prepare source and destination node features separately. Can you try replacing

input_features = {
                ntype: blocks[0].srcdata['feature'][ntype]
                for ntype in blocks[0].ntypes
            }

by

src_features = blocks[0].srcdata['feature']
dst_features = blocks[0].dstdata['feature']
input_features = (src_features, dst_features)

?

1 Like

I changed it accordingly and it did not throw any error anymore! Thank you!!

So now it runs through the training, but when it reaches the prediction of the score within the ScorePredictor, I receive an error:

  File "/.../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/.../LinkPredictHetero.py", line 199, in forward
    pos_score = self.predictor(positive_graph, outputs, eval_edge_type)
  File "/Users/sophiakrix/Envs/deeplink/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/.../ScorePredictor.py", line 40, in forward
    dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type)
  File "/.../lib/python3.7/site-packages/dgl/heterograph.py", line 4064, in apply_edges
    edata = core.invoke_gsddmm(g, func)
  File "/.../lib/python3.7/site-packages/dgl/core.py", line 195, in invoke_gsddmm
    x = alldata[func.lhs][func.lhs_field]
  File ".../lib/python3.7/site-packages/dgl/view.py", line 66, in __getitem__
    return self._graph._get_n_repr(self._ntid, self._nodes)[key]
  File "/.../lib/python3.7/site-packages/dgl/frame.py", line 373, in __getitem__
    return self._columns[name].data
KeyError: 'x'

My implementation of the ScorePredictor looks like this right now:

class ScorePredictor(nn.Module):
    def forward(
        self,
        edge_subgraph: dgl.DGLHeteroGraph,
        x: Dict[str,  torch.Tensor],
        eval_edge_type: str,
    ) -> torch.Tensor:
        """Perform score prediction only on the evaluation edge type.

        :param edge_subgraph: subgraph to be evaluated
        :param x: dictionary mapping node type  to features
        :param eval_edge_type: edge type to be evaluated
        :return: dictionary mapping edge type to the scores for the subgraph
        """
        with edge_subgraph.local_scope():

            edge_subgraph.ndata['x'] = x

            edge_subgraph.apply_edges(
                dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type)
            return edge_subgraph.edata['score']

@mufeili Actually, there is an error occurring when I defined the block.dstdata and block.srcdata like this. It is that the number of source node features does not match number of source nodes:

  File "/.../lib/python3.7/site-packages/dgl/heterograph.py", line 3752, in _set_n_repr
    ' Got %d and %d instead.' % (nfeats, num_nodes))
dgl._ffi.base.DGLError: Expect number of features to match number of nodes (len(u)). Got 24 and 20 instead.

I was able to avoid this error by changing the lines as follows:

h_src, h_dst = h
block.dstdata['h_dst'] = h_dst
# add [:block.num_src_nodes()] here to select only number of source nodes
block.srcdata['h_src'] = h_src[:block.num_src_nodes()]

Is this correct, also when we assume that the source nodes can be from different node types?

I was able to avoid this error by changing the lines as follows:

h_src, h_dst = h
block.dstdata['h_dst'] = h_dst
# add [:block.num_src_nodes()] here to select only number of source nodes
block.srcdata['h_src'] = h_src[:block.num_src_nodes()]

Is this correct, also when we assume that the source nodes can be from different node types?

This is not correct. Does changing

input_features = {
                ntype: blocks[0].srcdata['feature'][ntype]
                for ntype in blocks[0].ntypes
            }

to

src_features = blocks[0].srcdata['feature']
dst_features = blocks[0].dstdata['feature']
input_features = (src_features, dst_features)

work?

So now it runs through the training, but when it reaches the prediction of the score within the ScorePredictor , I receive an error:

  File "/.../lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/.../LinkPredictHetero.py", line 199, in forward
    pos_score = self.predictor(positive_graph, outputs, eval_edge_type)
  File "/Users/sophiakrix/Envs/deeplink/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/.../ScorePredictor.py", line 40, in forward
    dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type)
  File "/.../lib/python3.7/site-packages/dgl/heterograph.py", line 4064, in apply_edges
    edata = core.invoke_gsddmm(g, func)
  File "/.../lib/python3.7/site-packages/dgl/core.py", line 195, in invoke_gsddmm
    x = alldata[func.lhs][func.lhs_field]
  File ".../lib/python3.7/site-packages/dgl/view.py", line 66, in __getitem__
    return self._graph._get_n_repr(self._ntid, self._nodes)[key]
  File "/.../lib/python3.7/site-packages/dgl/frame.py", line 373, in __getitem__
    return self._columns[name].data
KeyError: 'x'

My implementation of the ScorePredictor looks like this right now:

class ScorePredictor(nn.Module):
    def forward(
        self,
        edge_subgraph: dgl.DGLHeteroGraph,
        x: Dict[str,  torch.Tensor],
        eval_edge_type: str,
    ) -> torch.Tensor:
        """Perform score prediction only on the evaluation edge type.

        :param edge_subgraph: subgraph to be evaluated
        :param x: dictionary mapping node type  to features
        :param eval_edge_type: edge type to be evaluated
        :return: dictionary mapping edge type to the scores for the subgraph
        """
        with edge_subgraph.local_scope():

            edge_subgraph.ndata['x'] = x

            edge_subgraph.apply_edges(
                dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type)
            return edge_subgraph.edata['score']

This is weird. What is edge_subgraph, what edge types does edge_subgraph have and what is eval_edge_type?

@mufeili I already changed this line

before I applied your other definition of input_features, therefore it only worked with slicing the h_src. When I then removed the slicing and passed the entire h_src, the error of non-matching number of source nodes and features was thrown again.

How can it be that I have more source nodes in the block than I have features?

edge_subgraph is the first parameter that is passed to the forward function of the ScorePredictor, here self.predictor, and is either:

>>> positive_graph
Graph(num_nodes={'disease': 0, 'drug': 0, 'protein': 24},
     ...,)
>>> negative_graph
Graph(num_nodes={'disease': 0, 'drug': 0, 'protein': 24},
     ...,
     )

and they are passed here:

class LinkPredictHetero(nn.Module):
    ...,
    def forward(
        self,
        positive_graph: dgl.DGLHeteroGraph,
        negative_graph: dgl.DGLHeteroGraph,
        blocks: List[dgl.DGLHeteroGraph],
        h: Dict[str, th.Tensor],
        eval_edge_type: str = 'drug-disease',
    ):
        """Custom forward method with the BaseRGCN as the encoder and the score predictor as the decoder.

        :param positive_graph: sampled heterograph made out of positive edges
        :param negative_graph: sampled heterograph made out  of negative e dges
        :param blocks: list of mini-batched heterographs from the given big graph
        :param h: dictionary mapping node type to the feature of the src node type (blocks[0].srcdata[ntype])
        :param eval_edge_type: the edge type to  be evaluated on
        :return: scores for positive and negative graph
        :rtype: dictionary mapping edge type to scores of edges for this edge type
        """       
        outputs = self.rgcn.forward(blocks, h)
    
        pos_score = self.predictor(positive_graph, outputs, eval_edge_type)
        neg_score = self.predictor(negative_graph, outputs, eval_edge_type)
        return pos_score, neg_score

Can it be the case that you specified a wrong eval_edge_type in edge_subgraph.apply_edges( dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type)?

With this definition of the forward() function of the RelGraphConv, hs gets redefined every layer. What I am wondering is if dictionary that is returned should actually be redefined every layer, or should it rather be updated to still store the results from the previous layers? I will give more explanation below:

So what I receive after the first convolution layer, is a dictionary with all the node types:

>>> hs
{'disease': tensor([[ 0.1046,  0.0523,  0.2368,  ..., -0.2478, -0.0778,  0.0166]
       ..., , grad_fn=<SumBackward1>), 
'drug': tensor([-0.3362,  0.1057,  0.1976,  ...,  0.1237,  0.2159, -0.0796],
       ...,grad_fn=<SumBackward1>),
'protein': tensor([[ 0.1074, -0.1772, -0.1734,  ..., -0.2601,  0.0980, -0.3836],
       ...,grad_fn=<SumBackward1>)}
>>> {key: v.shape for  key,v in hs.items()}
{'disease': torch.Size([13, 140]), 'drug': torch.Size([34, 140]), 'protein': torch.Size([357, 140])}

After the second layer, it is this:

>>> hs
{'disease': tensor([[ 1.6336e-01, -2.4944e-01, ...,  9.7798e-03]],
       grad_fn=<SumBackward1>),
 'drug': tensor([[-0.1506,  0.1218,  0.0553,  ...,  0.1870, -0.4286,  0.0218], ...,    grad_fn=<SumBackward1>),
 'protein': tensor([[-0.2897,  0.1191, -0.1739,  ...,  0.5679,  0.5015,  0.1734], ..., grad_fn=<SumBackward1>)}
>>> {key: v.shape for  key, v in hs.items()}
{'disease': torch.Size([1, 140]), 'drug': torch.Size([12, 140]), 'protein': torch.Size([214, 140])}

And after the third layer it has only protein:

>>> hs
{'protein': tensor([[-0.4364,  0.1131, -0.0278,  0.1530],
       ...,  grad_fn=<SumBackward1>)}
>>> {key: v.shape for  key,v in hs.items()}
{'protein': torch.Size([23, 4])}

I am wondering if this hs, which only has protein representations is actually the correct one to output from the model and then use as input for the ScorePredictor as x:

Update

I changed the forward function from the BaseRGCN to handle the pair of tensors as an input:

    def forward(self, blocks, h):
        h_src, h_dst = h
        h_src = self.embed_layer(h_src)
        h_dst = self.embed_layer(h_dst)
        h = (h_src, h_dst)
        for idx, layer in enumerate(self.layers):
            h = layer.forward(blocks[idx], h=h)
        return h

Now in every iteration h should be a pair of tensors. But the forward from RelGraphConv, through which h is passed, gives as a return a dictionary:

Should this return statement then be changed to return a pair of tensors? Which tensor should it return additionally, h_src or h_dst ?
I think my question is also what the output of the self.conv represents. Is it the updated source node features or the updated destination node features?

@mufeili What do you mean with specify the wrong evaluation edge type? I am actually only interested in one edge type to use for the prediction, so I want to check for each edge of the graph if this edge type would be predicted.

One concern I have now is that this edge type directly specified the node types of its source and destination node, which for the edge type 'drug-disease' the node types would be as follows: ('drug', 'drug-disease', 'disease'). Therefore, can the graph with edges having only protein nodes still be evaluated?

What I am wondering is if dictionary that is returned should actually be redefined every layer, or should it rather be updated to still store the results from the previous layers? I will give more explanation below:

What do you mean by redefining it every layer?

I am wondering if this hs , which only has protein representations is actually the correct one to output from the model and then use as input for the ScorePredictor as x

What task are you working on? Is this correct in terms of the task?

Now in every iteration h should be a pair of tensors. But the forward from RelGraphConv, through which h is passed, gives as a return a dictionary:

Should this return statement then be changed to return a pair of tensors? Which tensor should it return additionally, h_src or h_dst ?
I think my question is also what the output of the self.conv represents. Is it the updated source node features or the updated destination node features?

By RelGraphConv, I assume it’s actually HeteroGraphConv based on RelGraphConv, right? The output of a HeteroGraphConv layer is a dictionary mapping node types to the updated features of the corresponding destination nodes in the input block.

When using multiple HeteroGraphConv sequentially, you can directly pass the output of one HeteroGraphConv to the input of the next HeteroGraphConv. HeteroGraphConv will handle slicing internally here.

What do you mean with specify the wrong evaluation edge type? I am actually only interested in one edge type to use for the prediction, so I want to check for each edge of the graph if this edge type would be predicted.

For the error you previously encountered, you got KeyError: 'x' despite that you just assigned ndata['x']=x, this makes me wonder if the issue is due to a wrong edge type specified in edge_subgraph.apply_edges( dgl.function.u_dot_v('x', 'x', 'score'), etype=eval_edge_type).

One concern I have now is that this edge type directly specified the node types of its source and destination node, which for the edge type 'drug-disease' the node types would be as follows: ('drug', 'drug-disease', 'disease') . Therefore, can the graph with edges having only protein nodes still be evaluated?

In this case, you should train a model to update the representations of drug and disease nodes and then combine them for prediction. Updated representations for protein nodes don’t seem to be correct.

  1. With redefining hs at each layer I meant that after every iteration where it is passed into the RelGraphConv, a dictionary with different node types (as keys) is returned as in this post above.

  2. I am working on link prediction on a heterogeneous knowledge graph with node types drug, disease, protein, but I am only interested in predicting links between drug and disease nodes. Therefore, I only need to test on the edge type 'drug-disease'.

  3. How can I pass the output of one HeteroGraphConv to the next exactly? And in which function should this happen?

  4. Would it be helpful for you to see the entire package and how it is connected? I’ll show a summary here, but I can also provide the package if needed.
    For the structure of the modules I was following the implementation of rgcn-hetero closely. The module structure is this in short:

class BaseRGCN(nn.Module):
    def __init__(...):
        self.layers: nn.ModuleList = nn.ModuleList()
        # append 3x RelGraphConvLayer
        self.layers.append(RelGraphConvLayer())

    def forward(self, blocks, h):      
        ...
        for idx, layer in enumerate(self.layers):  
            h = layer.forward(blocks[idx], h=h)
        return h

class RelGraphConvLayer(nn.Module):
    def __init__():
        self.conv = HeteroGraphConv({
                rel: CustomHeteroGraphConv(...) for utype, rel, vtype in rel_names
            })

    def forward(self, g, h):
        ...
        hs = self.conv(g, inputs_src, mod_kwargs=wdict)
        ...
        return {ntype: _apply(ntype, h) for ntype, h in hs.items()}
   
class HeteroGraphConv(nn.Module):
    def forward(self, g, inputs, mod_args=None, mod_kwargs=None):
        ...
        rsts = {}
        for nty, alist in outputs.items():
            if len(alist) != 0:
                rsts[nty] = self.agg_fn(alist, nty)
        return rsts

class CustomHeteroGraphConv(nn.Module):
    def forward(self, block, h):
        ...
        return {ntype: block.dstnodes[ntype].data['h_dst'] for ntype in block.dsttypes}

When I am using the entire graph consisting of protein, drug and disease nodes, wouldn’t the connection to the protein nodes affect the representation of the drug and disease nodes? What I mean is that protein nodes have an effect on the updated representation of the drug and disease nodes, right? And they are in the sampled graphs from the EdgeDataLoader, so therefore they will be trained on. In my held-out test set I only have drug and disease nodes, therefore no protein representations.

With redefining hs at each layer I meant that after every iteration where it is passed into the RelGraphConv , a dictionary with different node types (as keys) is returned as in this post above.

Which function did you use for block construction?

I am working on link prediction on a heterogeneous knowledge graph with node types drug, disease, protein, but I am only interested in predicting links between drug and disease nodes. Therefore, I only need to test on the edge type 'drug-disease' .

In that case, you want to update the representations of drug and disease nodes using GNN and then score pairs of drug and disease nodes.

When I am using the entire graph consisting of protein, drug and disease nodes, wouldn’t the connection to the protein nodes affect the representation of the drug and disease nodes? What I mean is that protein nodes have an effect on the updated representation of the drug and disease nodes, right? And they are in the sampled graphs from the EdgeDataLoader , so therefore they will be trained on.

Yes, you are right.

In my held-out test set I only have drug and disease nodes, therefore no protein representations.

Is there an overlapping between the drug/disease nodes in the test set and the drug/disease nodes in the training set?

I used the EdgeDataLoader to construct the blocks.

And I think I found the bug. It was actually about passing the entire inputs to the self.conv and handling the format (tuple of tensors or not) already in the RelGraphConv. I added this to the RelGraphConv and adapted the entire pipeline accordingly, so now it seems to work!!!

        if isinstance(h, tuple) or g.is_block:
            if isinstance(h, tuple):
                _, inputs_dst = h
            else:
                inputs_dst = {k: v[:g.number_of_dst_nodes(k)] for k, v in h.items()}

        hs = self.conv(g, h, mod_kwargs=wdict)

No, there is no overlap between the training set and the held out test set.

Are test drug/disease nodes connected to the training drug/disease nodes via some relations like drug-treats-disease or drug-interacts-drug? If not, then incorporating relations involving proteins might not be helpful at all.