Node Classification over Multiple Graphs


I am currently trying to get started using DGL, by building a model that determines the maximum independent set of a graph. Recent (and older) papers do this by classifying each node of a graph with a score in [0,1], and then explore the solution space from this assignment.

However, I already struggle with the dataset preparation. The DGL documentation states how to create a dataset for node classification and graph classification. However, the node classification example assumes there only is a single graph, which is not true for MIS prediction. In MIS, we want to find one MIS per graph, i.e., label all vertices of a graph, but we have to train on a lot of graphs to learn this. I don’t understand how I should build a DGLDataset that contains multiple graphs, but labels per vertex on a graph, not per graph. Things I am not sure about include:

  1. how is the index in __getitem__ mapped to a vertex/graph? Do we return all labels for all vertices of graph idx? How should the Dataset class be structured in general?

  2. How does batching work in this case? Because we will have to calculate the loss using all vertex labels of a single graph, but then somehow deal with multiple losses.

Any help/pointers are very much appreciated. Thank you ver ymuch.


1 Like

For node classification, let’s say you have two graphs, one with 3 nodes and one with 5 nodes. If __getitem__ receives a value i in \{0, 1, 2\}, it should return the label of the corresponding node in the first graph. If it receives a value i in \{3, 4, 5, 6, 7\}, it should return the label of node i - 3 in the second graph. The high level idea is that we can index the nodes in all graphs as if we batch all the graphs. When __getitem__ receives a node ID, it will first map the ID to the graph ID and then the node ID in the corresponding graph. You can perform loss computation in a similar fashion.

Thank you very much for your reply!

But does this mean that already within the DGLDataset implementation, we batch our several graphs into one graph that consists of multiple independent components? Because what I got from the documentation, e.g. in snippets like this:

from dgl.dataloading import GraphDataLoader
dataloader = GraphDataLoader(

was that the dataset should not define batches themselves, because the GraphDataLoader is responsible for batching multiple graphs.

The issue I see currently is that within our training loop, we need to have something along these lines:

for epoch in epochs:
    for graph in graphs:

And if I just have an incremental mapping of node to graph, we have an issue, as the training loop doesn’t know whether a node ID maps to a new graph or not. So maybe the problem wasn’t too clear: For the MWIS, we classify each node of a graph, but as a group, i.e., each node is either path of the set of vertices we are looking for or it is not. Hence, the loss is calculated per graph, but we need to classify each node separately, in order to derive the loss. It’s not something we do node-by-node.

This is why I thought __getitem__(i) should return a list of all labels for graph i, instead of “flattening” this labelmapping. But this is not what you suggest.

Thank you very much again for your help. I hope I made my problem more clear now.

I think GraphDataLoader is more about graph property prediction with multiple graphs. Another possibility is to have one dataset object per graph. During training, you can first shuffle the dataset objects and then iterate over the dataset objects as you did in the for loop.

1 Like


I implemented the model as described in the original paper [1] that introduced the idea of calculating the MIS using a GCN. As you suggested, I have one dataset object per graph, and iterate over the numerous dataset objects. The training loop can be seen in [2], and the basic GCN as in the DGL documentation (+ hindsight loss) in [3].

However, the training is very slow, and as a newbie to DGL, I am not too sure on how to debug this. In our sever, we have a Tesla T4. The authors of [1] train on a Titan GPU, and claim to train 200 epochs of 38000 graphs in 16 hours. They implemented their model on top on the original GCN implementation in Tensorflow 1. For me, a single epoch in DGL takes around 2 hours (with 2 instead of 32 outputs, so less computation!) (cuda is enabled and works).

Could this be an issue with my training loop? I don’t think that the original GCN implementation is much more efficient than DGL. Any idea on how to approach an issue like this is very much appreciated, because it’s like a blackbox that for some reason is slow to me, currently.

Thank you so much for your help.


[1] [1810.10659] Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search
[2] import argparseimport timeimport numpy as npimport networkx as nximport -
[3] import torchimport torch.nn as nnfrom dgl.nn.pytorch import GraphConvimpor -

Hi Maximilian,

  1. How dense are the graphs?
  2. Did the authors release the code? Do you know if they train over one graph a time? My guess is that you probably still need to batch multiple graphs for each iteration. One possibility is that for each iteration you batch a list of randomly sampled graphs and keep the original membership of the nodes for later loss computation over graphs.
1 Like


thank you for your answer!

  1. Is there a density measure you want to know? Other than that, the graphs are converted SAT instances (i.e. generated by reducing a SAT instance to a MIS problem on the graph), and have ~1200 vertices each (because we have 400 clauses with 3 literals each), where the three vertices that represent a clause are all connected, and the vertices of the inverted literals are connected as well. So I’d say it’s rather dense. However, it’s exactly the same training data than for the original paper, so it (in theory) shouldn’t be the root cause of the issue.
  1. Yes, the authors released the code: NPHard/ at master · intel-isl/NPHard · GitHub While I think it’s kinda hard to read, in the paper, they state they “use Adam with single-graph mini-batches […}. Training proceeds for 200 epochs and takes about 16 hours on a desktop…”. I interpret single-graph mini batches that they did it the same way like I did it in my code. So I could try to batch, but a) I’m not sure whether it would work, as I believe the SAT training graphs could in theory also contain multiple unconnected components, such that the network couldn’t understand what the difference between another graph and a connected component is (if necessary, I’m not sure about that one) and b) I still think maybe there is an issue in my implementation (even though it’s very simple) because I just don’t believe that DGL in the same setting would be that much slower than the original 2017 Tensorflow1 GCN code the author’s repository uses…

But already thank you so much for your help.


  1. How many edges are there in a graph of 1200 vertices?
  2. Have you tried running the authors’ code? Did you see the same speed? The hardware you have might also be different from the one the authors have.

Hi. I have a question related to this. Is it possible to train a GNN for link prediction on multiple graphs instead of just one? Thank you.

Absolutely. You can batch multiple graphs, and then training a link prediction model on multiple graphs will be the same as training a link prediction model on a single graph.

1 Like

Hi. I also have a question related to Node Classification over Multiple Graphs. When you batch multiple graphs, How does the machine know that node 1 of graph 1 is the same as node 1 of graph 2?

I have trained my model and it gives me de same % of accuracy with batch multiple graphs and just with one of them. So I am guessing they do not know.

  1. Do these graphs have the same structure?
  2. Why do you expect node 1 in both graphs to be of the same class?

Thanks for the answer.

  1. Yes, they do have the same structure
  2. The graph has values of a specific point in time. But as only using one graph means only 80 nodes, I have decided to get more data of different times. Meaning I have 10 graphs with the same structure, but different values. Moreover, node 1 of graph 1 represents the same node as node 1 graph 2, the only difference is the values. My question is if with batch they know they are the same node only with different values.

If you only use one graph with multidimensional input features representing data of different times, then you won’t have this issue. If you need to explicitly create multiple graphs and batch them, most likely you will need to explicitly assign labels to the nodes and make copies of the node labels across different times.

1 Like

Ok. Thank you for your help.

Hi again.

And is it possible to train a R-GCN on multiple graphs without batching, with a loop? A problem I see is that since they have different number of nodes, it could be a problem with the model parameters, since one of the them (the hidden feats) should be the number of nodes, right?

RGCN(in_features, hidden_features, out_features, rel_names)

How could I solve this?

Thanks again.

To clarify, are we learning node embeddings from scratch? If yes, then the number of nodes will matter, otherwise no. Do you have a pointer to the RGCN implementation you are speaking of?

1 Like

Thanks for your answer. I am following: 5.3 Link Prediction — DGL 0.6.1 documentation

We dont learn them from scratch, since they come with their own attributes (tensors). Lets say every graph represents information from a single text, where nodes are entitites and we have two types of links. I would like to train the model in each of these graphs (which have different number of nodes, since one represents a different text) and then predict the links of new unseen graphs.

Another way of doing it would be by batching, but in that case I have another problem. How should I adapt the next negative graph function from the previous tutorial, so when I build the negative graph every graph from the batch doesn’t mix with eachother?

def construct_negative_graph(graph, k, etype):
    utype, _, vtype = etype
    src, dst = graph.edges(etype=etype)
    neg_src = src.repeat_interleave(k)
    neg_dst = torch.randint(0, graph.num_nodes(vtype), (len(src) * k,))
    return dgl.heterograph(
        {etype: (neg_src, neg_dst)},
        num_nodes_dict={ntype: graph.num_nodes(ntype) for ntype in graph.ntypes})

Thank you so much.

Perhaps you can do a for loop for negative graph construction to avoid mixing the nodes across graphs. Other than that you can still train the model on batched graphs.

1 Like

Thank you so much for your help.

And how could I do the split between train and test following that same tutorial, which doesn’t include that step?