Load_dataset for other datasets besides citation?

In the init.py for data, there is a function load_dataset that works for Cora/Citeseer/Pubmed. However, in that same file, there are includes for many other datasets as well. Is there any analogous load_dataset function to accesses these datasets as well? If not, is there a particular reason why that function does not support all of the included datasets?

It’s just due to historical reason, that the original GCN implementation uses load_dataset interface. Now we recommend using ds = dgl.data.CoraDataset() in favor of pytorch’s dataset interface.

Gotchya, so for Cora (and pubmed, etc) we should now use the dgl.data.MyDataSet() way of doing things, as with the other standard datasets in DGL. One more question: is there any way to programmatically get a list of supported datasets? Something like dg.data.datasets?


You can find the list at https://docs.dgl.ai/en/latest/api/python/data.html#dataset-classes

Hey @VoVAllen,

Do you have an example of loading a NetworkX graph to be used in GCN (example of GCN)? I found it very difficult to load my networkX graph to train/test it.

Thanks a lot


Do you have edge/node data now? I would suggest create dgl graph from the edge list.

For example

# G is a networkx graph
edge_list = [e for e in G.edges]
src, dst = zip(*edge_list)
g = dgl.DGLGraph()
g.add_edges(src, dst)

Thanks @VoVAllen,

I can convert my networkX graph into a dgl graph (which can be found here https://drive.google.com/file/d/1g9eJdyGuzDzEp_JES8yUXFaQV1Zt9iId/view?usp=sharing).

My question is how can I load my dgl graph into GCN in example of GCN for training and testing?

I have been suggested to create a new data class (similar to cora), but I simply don’t know how to do it without proper step-by-step instructions. Although, I can make my dataset similar to cora.content and cora.cite, but where to put those file and how to configure the codes are confusing.

Thanks in advance.


Instead of directly looking at how data class is implemented, I would rather recommend on inspecting which members of data object the GCN example used. By looking at the code here we can see the concrete list of necessary stuff to make GCN example work:

  • features: the node feature tensor.
  • labels: the node label tensor.
  • train_mask, val_mask, test_mask: a 0-1 mask on the nodes representing whether the node belongs to training, validation, or test set.
  • num_labels: number of possible labels (or number of classes).
  • graph: the DGLGraph itself.

So you don’t have to implement a brand new data object as in CoraDataset in order to run the GCN example with your data. Instead, you can just prepare those variables by replacing all data.foo occurrences with the stuff from your data.

Please feel free to follow up. Thanks.

Thanks @BarclayII,

it becomes a bit clearer now. Let say I have trained the model and save and load it with this commands

torch.save(model.state_dict(), PATH)

and load it again


If I have a new graph and want to its nodes by using my trained model, how should I proceed it? Does DGL has something like model.infer(model, DGLgraph) for node classification in a new graph?

Thanks so much

If your model’s forward function is written as something that accepts a graph argument and your model’s parameter list does not depend on the graph itself:

def forward(self, g, features):
    # g: graph
    # f: input features

Then you can just load in your new graph and call your loaded model with it

# Say you trained with this:
pred = model(g, features)
loss = compute_loss(pred, ...)
# Later on you can just run this:
new_pred = model(new_g, new_features)

There are two caveats though:

  1. If your model’s parameter list depend on the graph itself (e.g. you specified some trainable parameters for every node), then you cannot do this. In fact, you may want to think of how to define the parameters of new unseen nodes on the new graph.
  2. If the graph is too big to fit on a single GPU and you were training it with minibatch training and neighbor sampling, you may want to check the tutorial to see this tutorial to see the difference between minibatch training and inference. The graph during inference doesn’t have to be the same as that during training, and you can just replace the graph argument with your new graph when calling the inference function.

Thanks @BarclayII,

I still use the gcn example, which I think uses forward(self, features) as its forward function. In this case, how can i alter/prepare a graph that i want to infer its node? so that i can do training and inference separately.

Thanks and sorry for asking some basics questions

For GCN example, since the model doesn’t actually depend on the graph to instantiate, you can simply move the graph from __init__() to forward(). That is, to rewrite the model from using self.g in forward function to g in forward function, and pass in g directly as an argument:

class GCN(nn.Module):
    def __init__(self, ...):    # remove g from the arguments and don't store it as a member
    def forward(self, g, features):
        h = features
        for i, layer in enumerate(self.layers):
            if i != 0:
                h = self.dropout(h)
            h = layer(g, h)
        return h
model = GCN(...)
pred = model(g, features)

In this case, you can then do

train_pred = model(train_g, train_features)
test_pred = model(test_g, test_features)

Thanks @BarclayII,

I also found out that the example doesn’t consider the edge feature for node classification with GCN. As my graph has edge feature in it, how shall I incorporate it in class GCN?

Thanks a lot