Questions about DGL HeteroGraphs that you never dared to ask

BarclayII · April 8, 2020, 4:29pm

The array assigned for dgl.NTYPE and dgl.ETYPE should be an int64 vector, that is, it must be one-dimensional. Moreover, the values should be the index of the node type names and edge type names to be passed to dgl.to_hetero.

So let’s say that you have node types and edge types as follows:

g_nx_pet = networkx.Graph([(1, 2), (1, 3)])
g_dgl_pet = dgl.graph(g_nx_pet)
ntypes = ['tit','hea']
etypes = ['connects']

Then this doesn’t work since the array is two-dimensional float:

g_dgl_pet.ndata[dgl.NTYPE] = torch.tensor([[1.], [0.], [1.]])
g_dgl_pet.edata[dgl.ETYPE] = torch.tensor([[1.],[1.],[1.],[1.]])
hg_dgl_pet = dgl.to_hetero(g_dgl_pet, ntypes, etypes)

This also doesn’t work since there’s no 2nd edge type in your edge type list (the node and edge type IDs are labeled from 0).

g_dgl_pet.ndata[dgl.NTYPE] = torch.LongTensor([1, 0, 1])
g_dgl_pet.edata[dgl.ETYPE] = torch.LongTensor([1, 1, 1, 1])
hg_dgl_pet = dgl.to_hetero(g_dgl_pet, ntypes, etypes)

This will work:

g_dgl_pet.ndata[dgl.NTYPE] = torch.LongTensor([1., 0., 1.])
g_dgl_pet.edata[dgl.ETYPE] = torch.LongTensor([0.,0.,0.,0.])
hg_dgl_pet = dgl.to_hetero(g_dgl_pet, ntypes, etypes)
hg_dgl_pet.metagraph.edges()
# OutMultiEdgeDataView([('hea', 'tit'), ('hea', 'hea'), ('tit', 'hea')])

thiippal · April 8, 2020, 6:34pm

Massive thanks @BarclayII, finally got the logic behind this and now it works perfectly!

thiippal · April 12, 2020, 4:12pm

Back with more questions.

Is it correct that dgl.nn modules cannot be used with DGLHeteroGraphs?

More generally, I would be very interested in learning how graph neural networks handle typed nodes. If someone can point me to a useful, relatively simple explanation, I would be very thankful.

Is it possible to get a readout for a DGLHeteroGraph by averaging node features, if the node features are of different dimensionality?

ar795 · April 12, 2020, 4:48pm

Hi,
for the first question, (well its general version) RGCN or HAN(Hetero attention networks) are a good answers.
For the second question I have the same issue , not clear for me how can we apply readout on an heterograph or a batched heterograph

BarclayII · April 13, 2020, 7:34am

To add to the answer from @ar795, currently DGL GNN modules support unidirectional bipartite graphs as well. Simply supply a unidirectional bipartite graph as well as a pair of feature tensors on source/destination types and you should be good.

module = dgl.nn.SAGEConv(...)
g = dgl.bipartite(..., 'user', 'clicks', 'item')
result = module(g, (user_features, item_features))

For the second problem, currently we don’t have a one-liner for batched-heterograph readout, and you may need to do that yourself. @mufeili could probably add his thought on this.

mufeili · April 13, 2020, 1:42pm

@thiippal See if the workaround here is good for you.

sopkri · May 25, 2020, 5:01pm

Thank you @thiippal for this thread! Perfect for my question:

Is it possible to have multiple feature matrices for one node type?

I have a heterograph consisting of lets say node type A, B and C. There are n nodes of node type A.
I would like to add a feature matrix of (n x l) and another feature matrix of (n x k) for node type A. For the other node type B with m nodes, I would also like to add multiple feature matrices, that will have different sizes. Let’s say we add a feature matrix of (m x p) and one matrix of (m x q) for node type B.

How could I implement this?

Thank you in advance for your answer!

mufeili · May 30, 2020, 9:52am

Hi @sopkri, does the example below help?

import dgl
import torch

g = dgl.heterograph({
    ('user', 'follows', 'user'): [(0, 1), (1, 2)],
    ('user', 'plays', 'game'): [(0, 0), (1, 0), (1, 1), (2, 1)],
    ('developer', 'develops', 'game'): [(0, 0), (1, 1)],
})
g.nodes['user'].data['h1'] = torch.randn(3, 1)
g.nodes['user'].data['h2'] = torch.randn(3, 2)

ar795 · June 2, 2020, 5:04pm

Hi,
Please, I have a questions regarding graph representation learning :
Is it possible to learn a graph-level representation for an heterograph ?
Thanks.

mufeili · June 2, 2020, 5:36pm

If you have multiple heterogeneous graphs and you want to perform graph property prediction, then I think graph-level representations for heterogeneous graphs are natural.

ar795 · June 2, 2020, 5:51pm

Thanks for your reply.
In the case of an heterograph with multiple types of nodes and multiple types of edges, we can obtain a node level representation (using for example RGCN model), can we based on these representation generate a vector representation (using some readout functions the same way we did with homogeneous graphs)

mufeili · June 3, 2020, 5:23am

In general we can develop something like below following our DGLHeteroGraph.multi_update_all API:

def multi_node_readout(g, nfeats_dict, readout_dict, cross_reducer=None):
    """Readout for heterogeneous graphs based on node features.

    Parameters
    ----------
    g : DGLHeteroGraph
    nfeats_dict : dict
        Mapping node type to the corresponding node features
    readout_dict : dict
        Mapping node type to the corresponding readout function for 
        node features
    cross_reducer : str or None
        The way to combine readout from different node types, which can be 
        sum, min, max, mean, stack. If None, simply return them in a list.
    """

We can also have a counterpart of the one above for edge features dgl.multi_edge_readout(). What do you think?

sopkri · June 3, 2020, 8:02am

Thank you @mufeili for the reply. This is exactly what I was looking for! Sometimes it can be so easy and logical.

sopkri · June 3, 2020, 8:24am

Is there a function to generate a train/validation/test split for a heterograph?

I am trying to do link prediction on a heterograph with 3 different node and 6 different edge types.
What I am considering here is that for a heterograph, it would be necessary to ensure that

for every edge type, there is the same proportion of edges in the train and test sets, so that there is no bias towards any type of edges
all nodes are connected after sampling and there are no nodes without edges

I was thinking of some function similar to the generate_sampled_graph_and_labels() function for the RGCN.

Is there any function in the DGL library that could produce test and train heterographs that fulfilled these ( 1. and 2. ) conditions?

Thank you a lot in advance!

lujiaying · June 5, 2020, 12:20am

Hi,
I was wondering what is the recommended practice to store a big scale (~600K) HeteroGraphs to disk?
It seems that dgl.data.utils.save_graphs does not support HeteroGraph right now.

My current approach is to store hetero edges as dict and store corresponding node features in np.array. However, the data loading becomes incredible slow since I need to reconstruct every HeteroGraph from edge dict. Moreover, I could not pre-loaded HeteroGraphs in main memory because the whole dataset is too huge.

Thanks!

ar795 · June 9, 2020, 3:37pm

Sorry for the delay to answer. Regarding the multi_node_readout , you wrote that we apply readout function for each type of nodes and then based on the resulting readouts we combine them using the cross reducer parameter. This answers perfectly my question. However, by generating a representation of the whole graph this way, we will not lose a lot of information ? ( is it the best way to get a graph level representation ?)

ar795 · June 10, 2020, 2:24pm

Another question about’ heterograph’ graph level representation learning:
Given an heterograph with N types of nodes and M types of edges, we can pass it through a RGCN-based architecture and get then a node level representation for all type of nodes( dictionnary of hidden representations).
To obtain the graph level representation : can we just concat all the node representations ? do you think that applying other readout function(sum, average) will lead to a poor quality of the representation ?

classicsong · June 13, 2020, 8:38am

You mean the graph classification? You can try concat all the node representations just like homogeneous graph classification.

classicsong · June 13, 2020, 8:48am

For the first question, the easiest way is to split the raw input data by split triplets according to edge type. And we can keep the split files.
If you follows generate_sampled_graph_and_labels(), adding reverse edges ensure that nodes in the sampled has connected edges.

mufeili · June 13, 2020, 10:15am

Can you try the approach suggested in this issue?