Questions about DGL HeteroGraphs that you never dared to ask

Hi,
Please, I have a questions regarding graph representation learning :
Is it possible to learn a graph-level representation for an heterograph ?
Thanks.

If you have multiple heterogeneous graphs and you want to perform graph property prediction, then I think graph-level representations for heterogeneous graphs are natural.

Thanks for your reply.
In the case of an heterograph with multiple types of nodes and multiple types of edges, we can obtain a node level representation (using for example RGCN model), can we based on these representation generate a vector representation (using some readout functions the same way we did with homogeneous graphs)

In general we can develop something like below following our DGLHeteroGraph.multi_update_all API:

def multi_node_readout(g, nfeats_dict, readout_dict, cross_reducer=None):
    """Readout for heterogeneous graphs based on node features.

    Parameters
    ----------
    g : DGLHeteroGraph
    nfeats_dict : dict
        Mapping node type to the corresponding node features
    readout_dict : dict
        Mapping node type to the corresponding readout function for 
        node features
    cross_reducer : str or None
        The way to combine readout from different node types, which can be 
        sum, min, max, mean, stack. If None, simply return them in a list.
    """

We can also have a counterpart of the one above for edge features dgl.multi_edge_readout(). What do you think?

1 Like

Thank you @mufeili for the reply. This is exactly what I was looking for! Sometimes it can be so easy and logical.

1 Like

Is there a function to generate a train/validation/test split for a heterograph?

I am trying to do link prediction on a heterograph with 3 different node and 6 different edge types.
What I am considering here is that for a heterograph, it would be necessary to ensure that

  1. for every edge type, there is the same proportion of edges in the train and test sets, so that there is no bias towards any type of edges
  2. all nodes are connected after sampling and there are no nodes without edges

I was thinking of some function similar to the generate_sampled_graph_and_labels() function for the RGCN.

Is there any function in the DGL library that could produce test and train heterographs that fulfilled these ( 1. and 2. ) conditions?

Thank you a lot in advance!

Hi,
I was wondering what is the recommended practice to store a big scale (~600K) HeteroGraphs to disk?
It seems that dgl.data.utils.save_graphs does not support HeteroGraph right now.

My current approach is to store hetero edges as dict and store corresponding node features in np.array. However, the data loading becomes incredible slow since I need to reconstruct every HeteroGraph from edge dict. Moreover, I could not pre-loaded HeteroGraphs in main memory because the whole dataset is too huge.

Thanks!

Sorry for the delay to answer. Regarding the multi_node_readout , you wrote that we apply readout function for each type of nodes and then based on the resulting readouts we combine them using the cross reducer parameter. This answers perfectly my question. However, by generating a representation of the whole graph this way, we will not lose a lot of information ? ( is it the best way to get a graph level representation ?)

Another question about’ heterograph’ graph level representation learning:
Given an heterograph with N types of nodes and M types of edges, we can pass it through a RGCN-based architecture and get then a node level representation for all type of nodes( dictionnary of hidden representations).
To obtain the graph level representation : can we just concat all the node representations ? do you think that applying other readout function(sum, average) will lead to a poor quality of the representation ?

You mean the graph classification? You can try concat all the node representations just like homogeneous graph classification.

For the first question, the easiest way is to split the raw input data by split triplets according to edge type. And we can keep the split files.
If you follows generate_sampled_graph_and_labels(), adding reverse edges ensure that nodes in the sampled has connected edges.

Can you try the approach suggested in this issue?

I think graph-level representation learning for heterogeneous graphs is generally an under-explored area and I am not sure what will be the best approach to this problem. Meanwhile, if we take a look at the graph-level representations for homogeneous graphs, then in most of the cases, they are computed by some functions of latest node representations.

1 Like

In general, concatenating node representations for graph-level representations is not a good idea because that depends on the order of nodes. In most cases we consider two graphs to be the same if we can get one by reordering the nodes in the other.

1 Like

Thanks for the suggestion. Pickle dose allow to store heterograph into disk. However, my graph is quite large (600k graphs, 100 nodes with 2 types per graph, 100+ types of edges per graph). Using pickle seems not practical for storage efficiency and data loading.

Yes, we are working on a more efficient way for saving a heterograph and you may use pickle for a temporary workaround.

Thanks! @classicsong

Hi! Newbie in the DGL / GCN world here, thanks for this discussion.

I have an heterograph with two node types: user and item. I have feature vectors of different dimensionality for the node types (5 features for user and 3 for item). I am trying to generate node embeddings: my goal would be to have an embedding for each node of 100 dimensions (100 is an arbitrary choice). I seem to have a problem when passing my nodes through any layer.

#building the layer
layer = dglnn.HeteroGraphConv({
                    'buys' : dglnn.conv.SAGEConv(3, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu),
                    'bought-by' : dglnn.conv.SAGEConv(5, 100, 'gcn', 
                                        feat_drop=0.5, activation=F.relu)},
                    aggregate='sum)

#assigning features
item_feat = torch.tensor(item_feat)
user_feat = torch.ones((g.number_of_nodes('user')), 5)
features = {'item' : item_feat, 'user' : user_feat }

h = features
out = layer(g, h)

This gives me the error
DGLError: Expect number of features to match number of nodes (len(u)). Got 1892 and 2492 instead.

I have 1892 nodes of type user, and 2492 nodes of type item. My item_feat and user_feat tensor are of shape (number_of_nodes, number_of_features), e.g. (2492,3).

Thanks a lot in advance!

Try replacing

out = layer(g, h)

with

out = layer(g, (h, h))

By passing layer(g, h), HeteroGraphConv assumes that the conv modules only require source node features while SAGEConv requires the features of both the source nodes and the destination nodes due to the skip connection.

Thanks, it worked! However, I would like that my nodes have different feature sizes. When I put 3 features for the item and 5 features for the user, I get the error
DGLError: The feature shape of source nodes: (5,) should be equal to the feature shape of destination nodes: (3,).
How can I have different shapes of nodes? Thanks again!