Questions about DGL HeteroGraphs that you never dared to ask

lujiaying · June 5, 2020, 12:20am

Hi,
I was wondering what is the recommended practice to store a big scale (~600K) HeteroGraphs to disk?
It seems that dgl.data.utils.save_graphs does not support HeteroGraph right now.

My current approach is to store hetero edges as dict and store corresponding node features in np.array. However, the data loading becomes incredible slow since I need to reconstruct every HeteroGraph from edge dict. Moreover, I could not pre-loaded HeteroGraphs in main memory because the whole dataset is too huge.

Thanks!

ar795 · June 9, 2020, 3:37pm

Sorry for the delay to answer. Regarding the multi_node_readout , you wrote that we apply readout function for each type of nodes and then based on the resulting readouts we combine them using the cross reducer parameter. This answers perfectly my question. However, by generating a representation of the whole graph this way, we will not lose a lot of information ? ( is it the best way to get a graph level representation ?)

ar795 · June 10, 2020, 2:24pm

Another question about’ heterograph’ graph level representation learning:
Given an heterograph with N types of nodes and M types of edges, we can pass it through a RGCN-based architecture and get then a node level representation for all type of nodes( dictionnary of hidden representations).
To obtain the graph level representation : can we just concat all the node representations ? do you think that applying other readout function(sum, average) will lead to a poor quality of the representation ?

classicsong · June 13, 2020, 8:38am

You mean the graph classification? You can try concat all the node representations just like homogeneous graph classification.

classicsong · June 13, 2020, 8:48am

For the first question, the easiest way is to split the raw input data by split triplets according to edge type. And we can keep the split files.
If you follows generate_sampled_graph_and_labels(), adding reverse edges ensure that nodes in the sampled has connected edges.

mufeili · June 13, 2020, 10:15am

Can you try the approach suggested in this issue?

mufeili · June 13, 2020, 10:30am

I think graph-level representation learning for heterogeneous graphs is generally an under-explored area and I am not sure what will be the best approach to this problem. Meanwhile, if we take a look at the graph-level representations for homogeneous graphs, then in most of the cases, they are computed by some functions of latest node representations.

mufeili · June 13, 2020, 10:33am

In general, concatenating node representations for graph-level representations is not a good idea because that depends on the order of nodes. In most cases we consider two graphs to be the same if we can get one by reordering the nodes in the other.

lujiaying · June 14, 2020, 5:37pm

Thanks for the suggestion. Pickle dose allow to store heterograph into disk. However, my graph is quite large (600k graphs, 100 nodes with 2 types per graph, 100+ types of edges per graph). Using pickle seems not practical for storage efficiency and data loading.

mufeili · June 15, 2020, 2:55am

Yes, we are working on a more efficient way for saving a heterograph and you may use pickle for a temporary workaround.

sopkri · June 15, 2020, 2:53pm

Thanks! @classicsong

jedbl · August 14, 2020, 3:13pm

Hi! Newbie in the DGL / GCN world here, thanks for this discussion.

I have an heterograph with two node types: user and item. I have feature vectors of different dimensionality for the node types (5 features for user and 3 for item). I am trying to generate node embeddings: my goal would be to have an embedding for each node of 100 dimensions (100 is an arbitrary choice). I seem to have a problem when passing my nodes through any layer.

#building the layer
layer = dglnn.HeteroGraphConv({
                    'buys' : dglnn.conv.SAGEConv(3, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu),
                    'bought-by' : dglnn.conv.SAGEConv(5, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu)},
                    aggregate='sum)

#assigning features
item_feat = torch.tensor(item_feat)
user_feat = torch.ones((g.number_of_nodes('user')), 5)
features = {'item' : item_feat, 'user' : user_feat }

h = features
out = layer(g, h)

This gives me the error
DGLError: Expect number of features to match number of nodes (len(u)). Got 1892 and 2492 instead.

I have 1892 nodes of type user, and 2492 nodes of type item. My item_feat and user_feat tensor are of shape (number_of_nodes, number_of_features), e.g. (2492,3).

Thanks a lot in advance!

mufeili · August 15, 2020, 11:38am

Try replacing

out = layer(g, h)

with

out = layer(g, (h, h))

By passing layer(g, h), HeteroGraphConv assumes that the conv modules only require source node features while SAGEConv requires the features of both the source nodes and the destination nodes due to the skip connection.

jedbl · August 17, 2020, 2:17pm

Thanks, it worked! However, I would like that my nodes have different feature sizes. When I put 3 features for the item and 5 features for the user, I get the error
DGLError: The feature shape of source nodes: (5,) should be equal to the feature shape of destination nodes: (3,).
How can I have different shapes of nodes? Thanks again!

mufeili · August 18, 2020, 5:30am

As suggested here, if you use gcn for aggregator_type, SAGEConv expects the feature shape of source and destination nodes to be the same. You can choose to use a different aggregator type.

jedbl · August 18, 2020, 3:07pm

Thanks for pointing out the documentation, I did not notice this particularity. It now works, but I am having trouble with custom layers. I am trying to build a HeteroGraphConv with custom layers, but again, my dimensions are not matching.

Here is my customer Test_layer:

class Test_layer(nn.Module): #Replaces SAGEConv layer
    def reset_parameters(self):
        """Reinitialize learnable parameters."""
        gain = nn.init.calculate_gain('relu')
        nn.init.xavier_uniform_(self.fc_self.weight, gain=gain)
        nn.init.xavier_uniform_(self.fc_neigh.weight, gain=gain)
    
    
    def __init__(self,
             in_feats,
             out_feats):
        super(Test_layer, self).__init__()

        self._in_src_feats, self._in_dst_feats = dgl.utils.expand_as_pair(in_feats)
        self._out_feats = out_feats

        # NN through which the initial 'h' state passes
        self.fc_self = nn.Linear(self._in_dst_feats, out_feats)

        # NN through which the neighborhood messages passes
        self.fc_neigh = nn.Linear(self._in_src_feats, out_feats)
        self.reset_parameters()
            
    
    def forward(self, graph, feat):
               
        h_self = feat
        
        #get input features
        graph.srcdata['h'] = h_self
        
        #update nodes values
        graph.update_all( 
            fn.copy_src('h', 'm'), 
            fn.mean('m', 'neigh')) 
        
        h_neigh = graph.dstdata['neigh'] 
    
        print(h_self.shape)
        print(h_neigh.shape)

        # result is output of NN for h_self + output of NN for h_neigh
        rst = self.fc_self(h_self) + self.fc_neigh(h_neigh)
        
        self.reset_parameters()
        
        return rst

Then, I call the HeteroGraph with some user and item features.

layer = dglnn.HeteroGraphConv({'buys':Test_layer(5,100),
                               'bought-by':Test_layer(3,100)},
                               aggregate='sum')

item_feat = torch.randn((g.number_of_nodes('item'), 3))
user_feat = torch.randn((g.number_of_nodes('user'), 5))
features = {'item' : item_feat, 'user' : user_feat }

out = layer(g, features)

I get the following error.
RuntimeError: The size of tensor a (1892) must match the size of tensor b (2492) at non-singleton dimension 0
Which makes sense, since my h_self is of shape torch.Size([1892, 5]) and my h_neigh of size torch.Size([2492, 5]).

How can I arrange this so that my h_neigh becomes the right shape, i.e. 1892 as well? I though that my reduce function fn_mean would do so, but it appears I am not making it work properly.

Again, thanks for your answers, newbie here trying to understand better heterographs in DGL!

VoVAllen · August 20, 2020, 6:04am

Hi @jedbl ,

Could you please open a new post for your question? It’s a bit hard to track under this long thread. Thanks!

jedbl · August 20, 2020, 8:00pm

Hi @VoVAllen,
Thanks for pointing this out. I created a new trend here.

Thanks!

VoVAllen · August 24, 2020, 7:24am

Hi,

This thread is closed now. If you have any further questions, please feel free to open a new post.

VoVAllen · August 24, 2020, 7:24am