How to predict the directional edges in a heterogeneous network?

How to predict the directional edges in a heterogeneous network?
Our input data is a table of edges (potential edges) and a list of nodes. Grouptruth is a set of edges that are known to exist.
For example:
edgetable.csv (potential edges)
dst src attr
A B AtoB
A B BtoA
A C AtoC
B D BtoD

nodetable
type Attr
A type2 5
B type1 3
C type1 2
D type2 5

groundtruth
B D negative

Our purpose is twofold:

  1. whether there is a relationship between two types of nodes
  2. how to determine their direction?
    We know that RGCN/Sage might do it, but we don’t know how to design train, test and valid datasets. Could you kindly recommend relevant ideas and tutorials?

You can treat relationship with different direction as different relationships (i.e. a relationship and another “reverse” relationship), and edges with different direction as different edges (i.e. the edge from u to v and another edge from v to u are different).

For link prediction on heterogeneous graph you can refer to dgl/link_predict.py at master · dmlc/dgl · GitHub.

Thank you for your kind suggestion. However, we have not predefined train, test and valid classes for samples. What is your suggestion to setup those class?

The most common strategy is to split the edges uniformly into something like 8:1:1 ratio. Depending on your data, task and use case you may want specific strategies other than uniform splitting. I’m not sure what exactly your case is, so I can only give some examples:

  • If your use case is predicting future edges, you may need to split training/validation/test set according to timestamps.
  • If your use case is to predict connections to/from unseen nodes, then your edge split should be grouped by their incident nodes instead.
  • If your dataset is tiny, like a hundred edges or so, then cross-validation might be your best bet.

Your timely reply is helpful! Specifically, the goal in our case is to select the most essential interactions (edges) in edgetable.csv. We plan to calculate the weight of each edge based on GNN and ranking them. Might unsupervised graphSAGE be helpful (dgl/train_sampling_unsupervised.py at master · dmlc/dgl · GitHub)? However, since it is an unsupervised method, why did this code contain test and valid parts?

We do not have enough known direction information for training. Might you kindly recommend a unsupervised method to generate direction?

If you have very few labels then you can adapt traditional unsupervised learning methods or semi-supervised learning methods in deep learning. The following might give you an idea of how to adapt existing approaches to your problem - it’s just an example, so I can’t guarantee that it must work for your case.

For instance, a typical strategy of dealing with few labels is to first learn a general representation of each data instance (e.g. using an autoencoder), and then train a small model (like a linear classifier) on the learned representations. To adapt, you can

  • First learn a general representation of each edge, which can be expressed as a function of learned general node representations and edge features.
  • There are multiple ways to learn general node representations, e.g. using GraphSAGE or Graph Autoencoders.
  • Once the general node representations are learned, you can then train a simple linear classifier that takes the incident node representations (and edge features if applicable) as inputs. You may also want to consider higher-order features like what you would do in feature engineering.