Reading a Set of GML / JSON Files as Input Dataset

ghaffarian · September 24, 2019, 3:15pm

Hi;
I couldn’t find any sample code that reads the train/test dataset from a set of graph files generated by an external tool. In my case, I have a few hundred GML files (JSON version is also available), which each defines a single labeled multi-digraph. Each graph is a single instance in my binary-classification scenario. I want to read in this set and experiment with several GNNs (GCN, GAT, etc) and report the evaluation results.

I’m just getting started with DGL, and would highly appreciate if anyone could help me out with a sample code for reading these GML/JSON files in a format that is suitable to work with DGL.

VoVAllen · September 25, 2019, 5:06pm

Hi,

You can use networkx to read them (link). And use DGLGraph.from_networkx to convert it to DGLGraph.

ghaffarian · September 30, 2019, 9:07am

I was aware about that ability in networkx and I currently have a code as below for reading GML files:

import dgl
from networkx.readwrite import read_gml

nx_graph = read_gml('dataset/test3.gml', label='id')
dgl_graph = dgl.DGLGraph(nx_graph)

But this is just how to read individual graphs.

All examples on DGL’s GitHub do something like this at the beginning of the main:

import argparse
from dgl.data import register_data_args, load_data

parser = argparse.ArgumentParser(description='GAT')
register_data_args(parser)
args = parser.parse_args()
data = load_data(args)
...

With such example code, it is very unclear how should I read a set of graph files and perform my training/testing.

minjie · October 4, 2019, 1:11am

For dataset of many graphs, check out the treelstm example and the SST dataset it uses. Generally you want to define
(1) A dataset class that implements __getitem__ and __len__. Each __getitem__ returns one graph.
(2) A collate function to batch multiple graphs into one using dgl.batch.

Other similar datasets:

MiniGC: A synthetic dataset for graph classification.
TUDataset: Classical dataset for graph kernel benchmark.

ghaffarian · October 7, 2019, 12:27pm

Thanks for the tip. I have to say this doesn’t look very nice/clean for such a basic task as loading your train/test data!

In my searches I did came across some utility functions for saving/loading graphs:

dgl.data.utils.save_graphs(...)
dgl.data.utils.load_graphs(...)

These functions look promising and much cleaner that defining a new Dataset class.

Is there any example which shows how to use these functions for training/evaluating DGL GNNs (GCN, GAT, etc)?

qillbel · April 10, 2020, 2:26pm

Hey @ghaffarian,

Do you have an example on how to load your networkX dataset to train/test?

I find it very difficult to accommodate my networkX dataset into GCN? It is such a very basic task as you said to load the dataset, but I couldn’t find it anywhere in this libabry…

Thanks so much

minjie · May 18, 2020, 5:59am

Here is a simple example. Suppose you stored the following adjacency-like JSON format.

{
  'directed': False,
  'multigraph': False,
  'graph': [],
  'nodes': [{'id': 0}, {'id': 1}],
  'adjacency': [
    [{'id': 1}], [{'id': 0}]
  ]
}

You can first load that to networkx and convert it to DGL graph in one line:

import dgl
from networkx.readwrite import json_graph
data = ...  # your graph data in JSON
g = dgl.graph(json_graph.adjacency_graph(data))

The example can be adapted to GML files too.