Questions about DGL HeteroGraphs that you never dared to ask

Hi,
I was wondering what is the recommended practice to store a big scale (~600K) HeteroGraphs to disk?
It seems that dgl.data.utils.save_graphs does not support HeteroGraph right now.

My current approach is to store hetero edges as dict and store corresponding node features in np.array. However, the data loading becomes incredible slow since I need to reconstruct every HeteroGraph from edge dict. Moreover, I could not pre-loaded HeteroGraphs in main memory because the whole dataset is too huge.

Thanks!

Sorry for the delay to answer. Regarding the multi_node_readout , you wrote that we apply readout function for each type of nodes and then based on the resulting readouts we combine them using the cross reducer parameter. This answers perfectly my question. However, by generating a representation of the whole graph this way, we will not lose a lot of information ? ( is it the best way to get a graph level representation ?)

Another question about’ heterograph’ graph level representation learning:
Given an heterograph with N types of nodes and M types of edges, we can pass it through a RGCN-based architecture and get then a node level representation for all type of nodes( dictionnary of hidden representations).
To obtain the graph level representation : can we just concat all the node representations ? do you think that applying other readout function(sum, average) will lead to a poor quality of the representation ?

You mean the graph classification? You can try concat all the node representations just like homogeneous graph classification.

For the first question, the easiest way is to split the raw input data by split triplets according to edge type. And we can keep the split files.
If you follows generate_sampled_graph_and_labels(), adding reverse edges ensure that nodes in the sampled has connected edges.

Can you try the approach suggested in this issue?

I think graph-level representation learning for heterogeneous graphs is generally an under-explored area and I am not sure what will be the best approach to this problem. Meanwhile, if we take a look at the graph-level representations for homogeneous graphs, then in most of the cases, they are computed by some functions of latest node representations.

1 Like

In general, concatenating node representations for graph-level representations is not a good idea because that depends on the order of nodes. In most cases we consider two graphs to be the same if we can get one by reordering the nodes in the other.

1 Like

Thanks for the suggestion. Pickle dose allow to store heterograph into disk. However, my graph is quite large (600k graphs, 100 nodes with 2 types per graph, 100+ types of edges per graph). Using pickle seems not practical for storage efficiency and data loading.

Yes, we are working on a more efficient way for saving a heterograph and you may use pickle for a temporary workaround.

Thanks! @classicsong

Hi! Newbie in the DGL / GCN world here, thanks for this discussion.

I have an heterograph with two node types: user and item. I have feature vectors of different dimensionality for the node types (5 features for user and 3 for item). I am trying to generate node embeddings: my goal would be to have an embedding for each node of 100 dimensions (100 is an arbitrary choice). I seem to have a problem when passing my nodes through any layer.

#building the layer
layer = dglnn.HeteroGraphConv({
                    'buys' : dglnn.conv.SAGEConv(3, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu),
                    'bought-by' : dglnn.conv.SAGEConv(5, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu)},
                    aggregate='sum)

#assigning features
item_feat = torch.tensor(item_feat)
user_feat = torch.ones((g.number_of_nodes('user')), 5)
features = {'item' : item_feat, 'user' : user_feat }

h = features
out = layer(g, h)

This gives me the error
DGLError: Expect number of features to match number of nodes (len(u)). Got 1892 and 2492 instead.

I have 1892 nodes of type user, and 2492 nodes of type item. My item_feat and user_feat tensor are of shape (number_of_nodes, number_of_features), e.g. (2492,3).

Thanks a lot in advance!

Try replacing

out = layer(g, h)

with

out = layer(g, (h, h))

By passing layer(g, h), HeteroGraphConv assumes that the conv modules only require source node features while SAGEConv requires the features of both the source nodes and the destination nodes due to the skip connection.

Thanks, it worked! However, I would like that my nodes have different feature sizes. When I put 3 features for the item and 5 features for the user, I get the error
DGLError: The feature shape of source nodes: (5,) should be equal to the feature shape of destination nodes: (3,).
How can I have different shapes of nodes? Thanks again!

As suggested here, if you use gcn for aggregator_type, SAGEConv expects the feature shape of source and destination nodes to be the same. You can choose to use a different aggregator type.

Thanks for pointing out the documentation, I did not notice this particularity. It now works, but I am having trouble with custom layers. I am trying to build a HeteroGraphConv with custom layers, but again, my dimensions are not matching.

Here is my customer Test_layer:

class Test_layer(nn.Module): #Replaces SAGEConv layer
    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_uniform_(self.fc_self.weight, gain=gain)
        nn.init.xavier_uniform_(self.fc_neigh.weight, gain=gain)
    
    
    def __init__(self,
             in_feats,
             out_feats):
        super(Test_layer, self).__init__()

        self._in_src_feats, self._in_dst_feats = dgl.utils.expand_as_pair(in_feats)
        self._out_feats = out_feats

        # NN through which the initial 'h' state passes
        self.fc_self = nn.Linear(self._in_dst_feats, out_feats)

        # NN through which the neighborhood messages passes
        self.fc_neigh = nn.Linear(self._in_src_feats, out_feats)
        self.reset_parameters()
            
    
    def forward(self, graph, feat):
               
        h_self = feat
        
        #get input features
        graph.srcdata['h'] = h_self
        
        #update nodes values
        graph.update_all( 
            fn.copy_src('h', 'm'), 
            fn.mean('m', 'neigh')) 
        
        h_neigh = graph.dstdata['neigh'] 
    
        print(h_self.shape)
        print(h_neigh.shape)

        # result is output of NN for h_self + output of NN for h_neigh
        rst = self.fc_self(h_self) + self.fc_neigh(h_neigh)
        
        self.reset_parameters()
        
        return rst

Then, I call the HeteroGraph with some user and item features.

layer = dglnn.HeteroGraphConv({'buys':Test_layer(5,100),
                               'bought-by':Test_layer(3,100)},
                               aggregate='sum')

item_feat = torch.randn((g.number_of_nodes('item'), 3))
user_feat = torch.randn((g.number_of_nodes('user'), 5))
features = {'item' : item_feat, 'user' : user_feat }

out = layer(g, features)

I get the following error.
RuntimeError: The size of tensor a (1892) must match the size of tensor b (2492) at non-singleton dimension 0
Which makes sense, since my h_self is of shape torch.Size([1892, 5]) and my h_neigh of size torch.Size([2492, 5]).

How can I arrange this so that my h_neigh becomes the right shape, i.e. 1892 as well? I though that my reduce function fn_mean would do so, but it appears I am not making it work properly.

Again, thanks for your answers, newbie here trying to understand better heterographs in DGL!

Hi @jedbl ,

Could you please open a new post for your question? It’s a bit hard to track under this long thread. Thanks!

Hi @VoVAllen,
Thanks for pointing this out. I created a new trend here.

Thanks!

Hi,

This thread is closed now. If you have any further questions, please feel free to open a new post.